OHDSI / Athena

Web application for distributing and browsing the Standardized Vocabularies for all instances of an OMOP CDM
57 stars 19 forks source link

Review and fix "required" and "default" flags for vocab download #278

Open mik-ohdsi opened 2 years ago

mik-ohdsi commented 2 years ago

A recent forum post highlighted the issue that some vocabularies are set to "required" and as such will be always part of a download bundle / cannot be deselected. In particular among these, the vocabularies Korean Revenue Code, OSM and SOPT seemed a little off to be indispensable for an OMOP CDM.

These are the ones currently marked as "OMOP required":

vocabulary_id_v5 -- | CDM Cohort Type Concept Class Condition Status Condition Type Cost Cost Type Death Type Device Type Domain Drug Type Episode Korean Revenue Code Meas Type Metadata None Note Type Observation Type Obs Period Type OSM Plan Plan Stop Reason Procedure Type Relationship SOPT Sponsor Type Concept UB04 Point of Origin UB04 Pri Typ of Adm UB04 Pt dis status UB04 Typ bill UCUM US Census Visit Visit Type Vocabulary

I guess we can remove all the ones with a "Type" in their name except for the new Type Concepts as they have replaced them. The respective concepts per vocabulary ID could probably also be retired.

There was also the notion to mark more vocabularies as default that have standard concepts.

Here are the ones with standard concepts or classifications and their respective count together with a proposal how to set the default and required flags:

vocabulary_id description S / C row_count default now default future required now required future
ABMS Provider Specialty (American Board of Medical Specialties) S 85 X X    
AMT Australian Medicines Terminology (NEHTA) S 6839        
APC Ambulatory Payment Classification (CMS) S 715        
ATC WHO Anatomic Therapeutic Chemical Classification C 6509 X X    
BDPM Public Database of Medications (Social-Sante) S 1106        
Cancer Modifier Diagnostic Modifiers of Cancer (OMOP) S 3251        
CDM OMOP Common DataModel S 1045 X   X X
CDT Current Dental Terminology (ADA) S 869        
CMS Place of Service Place of Service Codes for Professional Claims (CMS) S 51 X X    
Cohort Legacy OMOP HOI or DOI cohort C 78        
Condition Status OMOP Condition Status S 22 X   X X
Cost OMOP Cost S 51     X X
CPT4 Current Procedural Terminology version 4 (AMA) C 3492 X X    
CPT4 Current Procedural Terminology version 4 (AMA) S 12922 X X    
Currency International Currency Symbol (ISO 4217) S 180 X X    
CVX CDC Vaccine Administered CVX (NCIRD) S 217        
DA_France Disease Analyzer France (IQVIA) S 6366        
dm+d Dictionary of Medicines and Devices (NHS) S 21071        
DRG Diagnosis-related group (CMS) S 752        
EphMRA ATC Anatomical Classification of Pharmaceutical Products (EphMRA) C 895        
Episode OMOP Episode S 14 X   X X
ETC Enhanced Therapeutic Classification (FDB) C 2755        
Ethnicity OMOP Ethnicity S 2 X X    
Gemscript Gemscript (Resip) S 64761        
Gender OMOP Gender S 2 X     X
GGR Commented Drug Directory (BCFI) S 751        
GRR Global Reference Repository (IQVIA) S 138739        
HCPCS Healthcare Common Procedure Coding System (CMS) S 8427 X X    
HemOnc HemOnc C 367        
HemOnc HemOnc S 2015        
HES Specialty Hospital Episode Statistics Specialty (NHS) S 57        
ICD10PCS ICD-10 Procedure Coding System (CMS) S 194874        
ICD9Proc International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) S 2223 X X    
ICDO3 International Classification of Diseases for Oncology, Third Edition (WHO) S 56972        
Indication Indications and Contraindications (FDB) C 4739        
ISBT Information Standard for Blood and Transplant 128 Product (ICCBBA) S 17336        
ISBT Attribute Information Standard for Blood and Transplant 128 Product Attribute (ICCBBA) C 1657        
JMDC Japan Medical Data Center Drug Code (JMDC) S 1313        
KDC Korean Drug Code (HIRA) S 112        
KNHIS Korean National Health Information System S 3        
Korean Revenue Code Korean Revenue Code S 7 X      
LOINC Logical Observation Identifiers Names and Codes (Regenstrief Institute) C 48305 X X    
LOINC Logical Observation Identifiers Names and Codes (Regenstrief Institute) S 110702 X X    
LPD_Australia Longitudinal Patient Data Australia (IQVIA) S 1620        
MDC Major Diagnostic Categories (CMS) S 26        
MedDRA Medical Dictionary for Regulatory Activities (MSSO) C 76939        
Medicare Specialty Medicare provider/supplier specialty codes (CMS) S 112 X X    
Metadata Metadata S 1 X   X X
MMI Modernizing Medicine (MMI) S 4        
NAACCR Data Standards & Data Dictionary Volume II (NAACCR) S 26105        
NCIt NCI Thesaurus (National Cancer Institute) S 1899        
NDC National Drug Code (FDA and manufacturers) S 11219 X X    
Nebraska Lexicon Nebraska Lexicon S 4187        
NFC New Form Code (EphMRA) C 692        
NUCC National Uniform Claim Committee Health Care Provider Taxonomy Code Set (NUCC) S 674 X X    
OMOP Extension OMOP Extension (OHDSI) S 553 X X    
OMOP Genomic OMOP Genomic vocabulary S 79791        
OPCS4 OPCS Classification of Interventions and Procedures version 4 (NHS) S 2373        
OSM OpenStreetMap S 203339     X  
PCORNet National Patient-Centered Clinical Research Network (PCORI) S 2        
Plan Health Plan - contract to administer healthcare transactions by the payer, facilitated by the sponsor S 11 X   X X
Plan Stop Reason Plan Stop Reason - Reason for termination of the Health Plan S 13 X   X X
PPI AllOfUs_PPI (Columbia) S 2120        
Provider OMOP Provider S 6       X
Race Race and Ethnicity Code Set (USBC) S 50 X X    
Relationship OMOP Relationship S 14 X   X X
Revenue Code UB04/CMS1450 Revenue Codes (CMS) S 538 X X    
RxNorm RxNorm (NLM) C 35087 X X    
RxNorm RxNorm (NLM) S 148139 X X    
RxNorm Extension RxNorm Extension (OHDSI) S 1819247 X X    
SMQ Standardised MedDRA Queries (MSSO) C 318        
SNOMED Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) S 540590 X X    
SNOMED Veterinary SNOMED Veterinary S 31994        
SOPT Source of Payment Typology (PHDSC) S 162 X X X  
SPL Structured Product Labeling (FDA) C 573209 X X    
Sponsor Sponsor - institution or individual financing healthcare transactions S 6 X X X X
Type Concept OMOP Type Concept S 79 X   X X
UB04 Pri Typ of Adm UB04 Claim Inpatient Admission Type Code (CMS) S 6     X X
UB04 Typ bill UB04 Type of Bill - Institutional (USHIK) S 4     X X
UCUM Unified Code for Units of Measure (Regenstrief Institute) S 922 X   X X
UK Biobank UK Biobank C 292        
UK Biobank UK Biobank S 3837        
US Census United States Census Bureau S 13     X X
Visit OMOP Visit S 19 X   X X

Please review @cgreich and @fdefalco !

Thanks - mik

fdefalco commented 2 years ago

I agree with most of the entries in the table. Ones I would question if they should be required in the future:

US Census UB04 Typ bill UB04 Pri Typ of Adm Sponsor

We could also remove the idea of "Required" in the interest of transparency and have a note appear on the page that a vocabulary is "Highly Recommended" when it is what we currently consider "Required" but still afford the user the opportunity to deselect it.

Then we would only have a boolean for "Default" for each vocabulary that can be edited by the user when creating their vocabulary download.

mik-ohdsi commented 2 years ago

Hi, thanks for the input, @fdefalco I think we have to keep required for very foundational data that you would need for a CDM to function (CDM, Metadata, a couple others). The ones you listed we would keep in for ease of use (they are small and in most cases needed), as it would not really make sense to NOT load them. @cgreich , can you provide more input? thanks - Mik

fdefalco commented 2 years ago

Is there a timeline for implementation of this particular feature?

mik-ohdsi commented 2 years ago

I had hoped. @cgreich would give us his final "placet". I would then hand over the above list for processing by the vocab team and it should go to Athena with the next release.

mik-ohdsi commented 2 years ago

bumping up this issue, @cgreich and @fdefalco
What is the verdict? I would also add the CVX vocabulary to default. And we have that funny OMOP supplier vocabulary with one non-standard concept in it... Do we need that?

mik-ohdsi commented 2 years ago

@ssuvorov-fls - could you check, if the above new settings would somewhat break something once they end up in Athena? Can we test run this in any QA instance?

Alexdavv commented 2 years ago

Korean Revenue Code, OSM and SOPT seemed a little off to be indispensable for an OMOP CDM

I think, the unspoken convention was to include everything that goes to the Domain missing its respective tables so that you don't miss the concepts for such "service" things as gender_concept_id, unit_concept_id, modifier_concept_id, route_concept_id, etc. Because it's not really obvious what vocabularies to pick if you want to add one more table/domain to your CDM. Region_concept_id somehow didn't materialize into a field but explains why OSM and US Cencus are there.

I guess we can remove all the ones with a "Type" in their name except for the new Type Concepts as they have replaced them

I wouldn't do it because the users that are updating their ETLs from some old vocabulary versions will just lose the concepts that appear it their mappings. I would never do it for the "service" small vocabularies.

Here are the ones with standard concepts or classifications and their respective count together with a proposal how to set the default and required flags

I didn't get the logic behind. How the gender is more important than the race? And why Sponsor is better than a Geography? We need to come up with the clear rules.

There was also the notion to mark more vocabularies as default that have standard concepts

Don't think it's a great choice before we cleaned up the EAV data. Otherwise, people will start map to UKB, PPI and NAACCR. And it's already the case.

mik-ohdsi commented 2 years ago

I think, the unspoken convention was to include everything that goes to the Domain missing its respective tables so that you don't miss the concepts for such "service" things as gender_concept_id, unit_concept_id, modifier_concept_id, route_concept_id, etc. Because it's not really obvious what vocabularies to pick if you want to add one more table/domain to your CDM. Region_concept_id somehow didn't materialize into a field but explains why OSM and US Cencus are there.

OSM is however one of the reasons, this whole discussion started... I guess I would still take it out of "required".

I wouldn't do it because the users that are updating their ETLs from some old vocabulary versions will just lose the concepts that appear it their mappings. I would never do it for the "service" small vocabularies.

hmm... have we mapped old type concepts over to the new ones? If so, it would make sense to keep them. but otherwise aren't they simply useless now and all non-standard?

I didn't get the logic behind. How the gender is more important than the race? And why Sponsor is better than a Geography? We need to come up with the clear rules.

Well, this is derived a little from how it was before. Gender is really indispensable, whereas Race & Ethnicity is, as we know, US centric... and they are still marked as default, so most people will keep them in their download. They just have a choice to deselect.

There was also the notion to mark more vocabularies as default that have standard concepts

Don't think it's a great choice before we cleaned up the EAV data. Otherwise, people will start map to UKB, PPI and NAACCR. And it's already the case.

Of course we would not follow that notion blindly and hence the above are not marked as default. But you cannot prevent people from selecting them for download, unless we would make them something like license restricted (only not license but something else).

fdefalco commented 2 years ago

The original intent of the discussion was to promote transparency and flexibility in vocabulary download. As it stands, vocabularies that are not listed or selected are included in the download, so for transparency, they should be listed and selected by default. For flexibility the user can have the option to unselect vocabularies. I'm not sure what benefit preventing a user from unselecting a vocabulary would provide, if you reject defaults you should be doing so for a well understood reason. Perhaps a warning on the page that says 'Default vocabularies are selected to provide important concepts to most ETL processes, remove them from the selected vocabularies at your own risk.' :)

mik-ohdsi commented 2 years ago

@cgreich has an even stricter view on this. I think he used the word "dogmatic". Let's hear him out. (Christian, one exception to the rule should be vocabularies that have standard items but are also license restricted such as CDT or ISBT).

fdefalco commented 2 years ago

I think Patrick echoed my concern on transparency here: https://forums.ohdsi.org/t/osm-vocabulary/16303/11

cgreich commented 2 years ago

Are we debating here or there?

fdefalco commented 2 years ago

We are discussing the changes to be made as part of this issue here, informed by the conversation there. I don't think there is any debate regarding the need for transparency of vocabularies that are included in a download. I imagine the remaining debate is whether or not to provide the user the ability to control whether not 'default' vocabularies are included. My vote is that the user is provide control with a stern warning about why defaults should be left as is.

cgreich commented 2 years ago

@fdefalco:

Hang on a sec. Right now, the thinking is we have three categories (not two):

The proprietary vocabularies are in the Rest category, since they need to be individually clicked and processed anyway.

We will have to change Athena to always include all standard concepts (easy), and create different sets of recommended vocabularies (North America, Europe, Rest of World maybe). Not a big deal, but will require some work.