Feature Request: Vocabulary Contribution Linter

haydenbspence commented 1 year ago

Describe the problem in content or desired feature

This feature introduces an automated linter for community contributions, significantly enhancing the quality control and integration process within the system. Through a series of steps, it standardizes and validates contributions using predefined rules, ensuring alignment with specific templates and the common data model.

Introduces an automated linter to validate and standardize community contributions, streamlining the submission process and ensuring alignment with predefined rules and templates.
Integrates seamlessly into the existing ETL process, utilizing a GitHub Action and the GDrive API for efficient extraction and transformation, while storing contributions temporarily in DuckDB for further integration.
Prepares contributions for review by both Vocabulary Team and Contributors, provides detailed feedback through reports, increasing transparency and communication.

How to find it The working branch dbt-contribution is located on a fork of Vocabulary-v5.0.

A minimal working example can be found here for the lint portion.

Expected adjustments A clear and concise description of what you expect the Vocabulary Team to change. If you have collected information in a spreadsheet, please share it with us.

Screenshots Diagram of Process

Additional context: Process

1 - The standard process is followed for submitting a community contribution as outlined by the Vocabulary WG 2 - A Github Action on a routine schedule triggers a Service Account to pull submitted Contributions through the GDrive API 3 - The contributions are passed to the Contribution Linter in individual packages (1 contribution = 1 folder = 1 package) 4 - The Contribution Linter is informed by checklists of each template, the common data model, and other prescribed rules. 5 - The Contribution is then stored in a temporary DuckDB flat file which can be later used by dbt models for integration to the CDM 6 - A report is saved to each folder (Contribution) detailing where the Contribution Linter made adjustments (e.g. removing trailing white spaces), if there are an cautions (things the submitter or Voc Team should look into), and if there are any errors (things that would prohibit the contribution from inclusion).

Additional context: Next Steps

Create test contributions for linter adjustment features.
Style report for user friendly reading.
Security and Authentications: Discuss with Vocabulary Team and Athena RE: APIs, Service Accounts, GDrive Settings.

jaanli commented 1 year ago

This is amazing work @haydenbspence ! Thank you for following up on our brief discussion about this idea so thoroughly after the Vocabulary Working Group!! Happy to help @onefact as we have been building the COST/PRICE table in consultation with the common data model working group and need this linter as well :)

TinyRickC137 commented 1 year ago

Wow! This is amazing project! @cgreich @aostropolets @Alexdavv

We will dive into it

jaanli commented 1 year ago

I agree @TinyRickC137 ! We are happy to help out on this @onefact to think through the user journey or technical side.

For tech that might be needed:

I think embeddings from fasttext (https://fasttext.cc/docs/en/support.html) or other simple models will be sufficient for "linting" purposes (finding duplicates or semantically similar things and warning/notifying a user through a CI/CD pipeline). But if needed we have trained several embedding models on EHR/claims/price data and work with one of the large language models I helped build in grad school (https://arxiv.org/abs/1904.05342).

For the PRICE/COST table in OMOP we are building, it is simpler in some ways because the vocabulary is limited to what is already in OHDSI, pending a new vocabulary describing all payors in the United States (and other countries).

This is all very exciting!!

haydenbspence commented 1 year ago

@jaanli

Wondering what your thoughts are on using DuckDB Postgres Scanner via GitHub Actions to query pgvector for stored embedding?

I am also looking at Replicate, but it is in Alpha, and I assumed DuckDB performs reasonably as well for this.

OHDSI / Vocabulary-v5.0

Feature Request: Vocabulary Contribution Linter #871