cancerDHC / ccdhmodel

CRDC-H model in LinkML, developed by the Center for Cancer Data Harmonization (CCDH)
https://cancerdhc.github.io/ccdhmodel/
BSD 3-Clause "New" or "Revised" License
16 stars 8 forks source link

Determine how enumerations and recommended codes should be handled in the CRDCH model #159

Open gaurav opened 2 years ago

gaurav commented 2 years ago

Given the move to CodeableConcepts as the way in which the CRDCH model handles codes, we should put some thought into how enumerations and recommended codes are handled in the CRDCH model.

At the moment:

  1. The Terminology Service picks up enumerations (i.e. lists of permissible values = PVs) from all the node dictionaries.
  2. The CRDCH model contains information about which CRDCH fields are direct/exact mappings to node fields. For example, CRDCH:BodySite.site is directly mapped to GDC:Case.primary_site.
  3. Some enumerations, such as CancerStageObservationSet.method_type, are defined in the CRDCH model itself.
  4. It was always a little unclear what should happen with enumerations that are not defined in the CRDCH model. For CRDC-H v1.0, we decided to provide some values that could be used for CRDCH:BodySite.site by starting with the PVs for GDC:Case.primary_site as well as all the other fields that CRDCH:BodySite.site.
    • I think part of the reason for doing this was that Dazhi got a bunch of mappings from GDC fields to caDSR and NCIt terms from NCI EVS, which is helpful in allowing us to use codes (e.g. NCIt:C12666 rather than “Adrenal gland”).
    • I think the reasoning behind keeping this code in the Terminology Service rather than the CRDCH Model generation code (sheet2linkml) was that this data could be seamlessly replaced by manually curated PVs if need be -- e.g. for CRDCH:BodySite.site, we could replace the list of PVs with a list of anatomical sites from, say, UBERON or another ontology or by manually curating a list ourselves.
    • In some of these cases, it would make sense for the Terminology Service to store the list of all possible values (e.g. from the UBERON ontology). In other cases, it would make more sense for the Terminology Service to dynamically query another service, such as the NCI Thesaurus or EMBL-EBI OxO.
  5. The CRDCH model generation code (sheet2linkml) asks the CCDH Terminology Service for the list of PVs for CRDCH fields whose range is given as “CodeableConcept”. It creates an enumeration in the model that stores “recommended” values for a field.

There are three main problems with this approach:

  1. There isn't a canonical way to indicate "these are recommended codes for this list of CodeableConcepts".
    • We can specify that an enumerated field should be specified as "a list of codings" (see option FHIR_CODING in pv_formula_options), but I don't think that's exactly the same thing. If that were to correctly represent a list of CodeableConcepts, however, then I think there is an upcoming LinkML feature that would allow us to indicate that an enumeration is "open", i.e. the included codes are recommended but not required.
    • Instead, we currently use the LinkML values_from field to indicate the enumeration that we recommend a CodeableConcept field should "use values from". It is not clear if this is the intended use of this field.
  2. It is unclear whether this strategy of "concatenating all PVs from all data dictionaries" really is the best strategy to generate recommended PVs at the moment.
  3. Implementing this requires a complex interplay between sheet2linkml and the Terminology Service, which can be summarized as: (1) the Terminology Service needs to read the CRDCH model in LinkML in order to read the mappings from CRDCH fields to node dictionary fields in order to generate lists of PVs, but (2) the CRDCH model in LinkML cannot be fully generated until it can read the enumerations from the Terminology Service. We're currently working past this problem by running sheet2linkml -> terminology server regenerate -> sheet2linkml, but if this strategy is no longer needed, then we can look into alternatives, such as:
    1. Option 1: Get rid of “recommended” values in the CRDCH model entirely -- all users should just query the CCDH Terminology Service for recommended values for CRDCH fields.
    2. Option 2: Move the logic described in step 4 above to the CRDCH generation code -- i.e. rather than querying the CCDH Terminology Service for PVs for “CRDCH:BodySite.site”, query the Terminology Service for PVs for “GDC:Case.primary_site”, “PDC:Case.primary_site”, and so on, and then concatenate the resulting values in some way.

Plan to tackle this:

gaurav commented 2 years ago

Related to #161, since part of the benefit of solving this problem would be to ensure that all mapping values are documented.

joeflack4 commented 2 years ago

Personally, I'm leaning towards option 2. It is less work for me, but it's in line with what I think the terminology service should be doing. I feel like it should be as dumb as possible, serving up things that have been pre-calculated. I feel like the better design is to reduce bidirectional dependencies, so to speak, which I think is the case right now.