airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

How do we handle IDs that point back to originating repository #56

Open bcorrie opened 1 month ago

bcorrie commented 1 month ago

When converting a Repertoire into its constituent parts in the AKC data model, how are we planning on keeping track of where the originally generated data came from. The AKC for at least data provenance and reproducibility need to track where the originating data came from.

Currently, when I am converting an ADC entity I am adding fields to the AKC LinkML generated that have adc_ prefixed onto the ADC ID field name. For example:

    "Specimen": {
      "M369-S008": {
        "akc_id": "",
        "specimen_type": "Venipuncture blood samples were collected in K2EDTA-coated vacutainers",
        "tissue": {
          "label": "venous blood",
          "id": "UBERON:0013756"
        },
        "adc_repertoire_id": "5ed6859e99011334ac05e847",
        "adc_sample_processing_id": "5ed6859e99011334ac05e847",
        "adc_data_processing_id": "5ed6859e99011334ac05e847",
        "adc_study_id": "PRJNA628125",
        "adc_subject_id": "7450",
        "adc_sample_id": "M369-S008"
      }
    }
schristley commented 3 weeks ago

ForeignObject is a class I added to represent such objects. The question is how to organize. For example, do we create separate slots for each like in your example above with adc_repertoire_id and etc. I feel this is "messy", meaning we will have a proliferation of those slots and they are all highly customized to the repository being integrated. It would be nice to have something more generic.

I was thinking of something simple like a source_uris slot which holds them and use CURIE/prefixes, so to adjust your example above:

"Specimen": {
      "M369-S008": {
        "akc_id": "",
        "specimen_type": "Venipuncture blood samples were collected in K2EDTA-coated vacutainers",
        "tissue": {
          "label": "venous blood",
          "id": "UBERON:0013756"
        },
        "source_uris": [ "ADC_REPERTOIRE:5ed6859e99011334ac05e847", "ADC_SP:5ed6859e99011334ac05e847", "ADC_DP:5ed6859e99011334ac05e847" ]
      }
    }

but I haven't thought through it completely. What do you think?

For the other repositories it is straightforward but for the ADC, it's complicated because we have many repositories so we get a proliferation of prefixes which goes to my concerns in #32