Open haydenbspence opened 1 year ago
This is amazing work @haydenbspence ! Thank you for following up on our brief discussion about this idea so thoroughly after the Vocabulary Working Group!! Happy to help @onefact as we have been building the COST/PRICE table in consultation with the common data model working group and need this linter as well :)
Wow! This is amazing project! @cgreich @aostropolets @Alexdavv
We will dive into it
I agree @TinyRickC137 ! We are happy to help out on this @onefact to think through the user journey or technical side.
For tech that might be needed:
I think embeddings from fasttext (https://fasttext.cc/docs/en/support.html) or other simple models will be sufficient for "linting" purposes (finding duplicates or semantically similar things and warning/notifying a user through a CI/CD pipeline). But if needed we have trained several embedding models on EHR/claims/price data and work with one of the large language models I helped build in grad school (https://arxiv.org/abs/1904.05342).
For the PRICE/COST table in OMOP we are building, it is simpler in some ways because the vocabulary is limited to what is already in OHDSI, pending a new vocabulary describing all payors in the United States (and other countries).
This is all very exciting!!
@jaanli
Wondering what your thoughts are on using DuckDB Postgres Scanner via GitHub Actions to query pgvector for stored embedding?
I am also looking at Replicate, but it is in Alpha, and I assumed DuckDB performs reasonably as well for this.
Describe the problem in content or desired feature
This feature introduces an automated linter for community contributions, significantly enhancing the quality control and integration process within the system. Through a series of steps, it standardizes and validates contributions using predefined rules, ensuring alignment with specific templates and the common data model.
How to find it The working branch dbt-contribution is located on a fork of Vocabulary-v5.0.
A minimal working example can be found here for the lint portion.
Expected adjustments A clear and concise description of what you expect the Vocabulary Team to change. If you have collected information in a spreadsheet, please share it with us.
Screenshots
Additional context: Process
1 - The standard process is followed for submitting a community contribution as outlined by the Vocabulary WG 2 - A Github Action on a routine schedule triggers a Service Account to pull submitted Contributions through the GDrive API 3 - The contributions are passed to the Contribution Linter in individual packages (1 contribution = 1 folder = 1 package) 4 - The Contribution Linter is informed by checklists of each template, the common data model, and other prescribed rules. 5 - The Contribution is then stored in a temporary DuckDB flat file which can be later used by dbt models for integration to the CDM 6 - A report is saved to each folder (Contribution) detailing where the Contribution Linter made adjustments (e.g. removing trailing white spaces), if there are an cautions (things the submitter or Voc Team should look into), and if there are any errors (things that would prohibit the contribution from inclusion).
Additional context: Next Steps