cancerDHC / ccdhmodel

CRDC-H model in LinkML, developed by the Center for Cancer Data Harmonization (CCDH)
https://cancerdhc.github.io/ccdhmodel/
BSD 3-Clause "New" or "Revised" License
16 stars 8 forks source link

Finalize IRI identifiers for CRDC-H model as well as nodes #61

Open gaurav opened 3 years ago

gaurav commented 3 years ago

We currently use a number of dummy prefixes in the CCDH model:

prefixes:
  linkml: https://w3id.org/linkml/
  ccdh: https://example.org/ccdh/
  NCIT: http://purl.obolibrary.org/obo/NCIT_
  GDC: http://example.org/gdc/
  PDC: http://example.org/pdc/
  ICDC: http://example.org/icdc/
  HTAN: http://example.org/htan/

We should replace these with actual IRIs.

For the CCDH IRIs, we should probably register a ccdh or crdc-h or crdch prefix at w3id.org and use that.

As per the Identifier Recommendations, the CRDC prefix will be at https://w3id.org/crdc/ and this will be used as e.g. subject crdc:su0000001 (for a subject), crdc:st000002 (for a study), and so on. So it might make sense to reserve a two-letter code for the model (dm?) and make properties based on that, but I think we'd prefer e.g. ccdh:BodySite__site rather than crdc:dm0000431.

For the node IRIs, this is primarily a convenience tool so we can use LinkML mapping fields, which use CURIE mappings. We could ask LinkML for non-CURIE mapping fields and use those instead, or we could try to find actual IRIs that make sense (e.g. for Sample.sample_type, we can construct the pretty odd IRI https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample&anchor=sample_type to look up its documentation). We could also ask GDC to mint identifiers for their properties.

We actually have another odd possibility for node properties: many of them are present in the caDSR and the NCI Thesaurus, so instead of GDC:sample.sample_type we could say caDSR:3111302v2.0 or NCIT:C70713.

fragosog commented 2 years ago

I've been looking at the .ttl output, and there's another IRI in there, http://UNKNOWN.org/ used in the enumerations. (Maybe it needs to be the same as ccdh's?)

gaurav commented 2 years ago

In the DMH call just now, Matt and Brian mentioned that these should use the standards laid out in the DST publication on identifiers.

gaurav commented 2 years ago
  1. We prefer crdch:Entity.attribute. Entity should be capitalized (which doesn't appear to be the case right now).
  2. We need to deal with versioning as well, so maybe something like https://w3id.org/crdc/v1.1/Entity.attribute?
  3. What about numerical identifiers? e.g. crdc:dm0000431
    • Upsides: versioning not required, labels can be changed over time to improve meaning
    • Downsides: not human-readable, more work maintaining the numerical codes, developers prefer human-readable attribute names

@cmungall @jmcmurry @majensen Thoughts? We could schedule some time to discuss this on one of our calls, but it'd be great to have initial thoughts in this GitHub issue.

jmcmurry commented 2 years ago

I don't quite understand the options; let's discuss over slack