Storing researcher assigned identifiers

bcorrie commented 3 months ago

When we convert something like an AIRR Study to an AKC Investigation, they are identifiers from the AIRR world (e.g. study_id) that need to be maintained in the AKC. In this case, the AIRR study ID is supposed to be a BioProject like PID and Investigation has an archival_id which seems to match although I am not 100% sure that is the intent.

The other entities such as Participant and Specimen do not have such terms. In the ADC we store both a subject_id and a sample_id. These are actually quite valuable fields as they are typically defined in the study and allow the researcher to map back to findings in the paper. We don't have any "researcher assigned" fields in the AKC model. Almost all the "IDs" are internal IDs.

bcorrie commented 3 months ago

Currently I am storing these in the AKC LinkML that is generated with an adc_ prefix and the field name. For example:

    "Specimen": {
      "TW01A_B_naive": {
        "akc_id": "",
        "tissue": {
          "id": "UBERON:0013756",
          "label": "venous blood"
        },
        "adc_repertoire_id": "2564613624180576746-242ac113-0001-012",
        "adc_data_processing_id": "6414d653-edd2-4d26-be1d-98a82f5e9c98-007",
        "adc_study_id": "PRJNA300878",
        "adc_subject_id": "TW01A",
        "adc_sample_id": "TW01A_B_naive"
      },

bcorrie commented 3 months ago

This related to #56 and storing the source repository IDs such as repertoire_id etc.

schristley commented 3 months ago

Do we need to store them? At least for the example you give, those IDs can be retrieved from the repertoire object.

bcorrie commented 2 weeks ago

I would think the AKC would want to store these. Otherwise one would need to jump back and forth between the AKC and the ADC to understand things at a level that a user would want to if working with the AKC.

If we solve the problem of having URIs (as in https://github.com/airr-knowledge/issues/issues/56) for external objects in external repositories, then yes, we could certainly leave a bunch of this type of info in the ADC repositories. But I would then ask the question what is the AKC doing if not integrating this type of data across the repositories at large (ADC, IEDB, IRAD, OGRDB, VDJBASE).

It doesn't make a lot of sense to me to not store this piece of what I consider pretty critical metadata that describes the investigation in question.

bcorrie commented 2 weeks ago

Let me put it this way. If you didn't have a investigator assigned name for a Participant, as far as I can tell when you provide a list of Participants and the metadata about them, the only thing the user would have to differentiate between two different participants is the UUID. I suppose that works, but ugghhh 8-)

schristley commented 2 weeks ago

Currently I am storing these in the AKC LinkML that is generated with an adc_ prefix and the field name. For example:

    "Specimen": {
      "TW01A_B_naive": {
        "akc_id": "",
        "tissue": {
          "id": "UBERON:0013756",
          "label": "venous blood"
        },
        "adc_repertoire_id": "2564613624180576746-242ac113-0001-012",
        "adc_data_processing_id": "6414d653-edd2-4d26-be1d-98a82f5e9c98-007",
        "adc_study_id": "PRJNA300878",
        "adc_subject_id": "TW01A",
        "adc_sample_id": "TW01A_B_naive"
      },

There are multiple issues involved (as indicated by #56 and #63), and each one has a slightly different mapping.

adc_repertoire_id is meant to be a globally unique id so that's why it's a ForeignObject.
adc_data_processing_id is needed for provenance so also ForeignObject.

These two are sufficient to uniquely identify the "source data" from an ADC repository.

adc_study_id is mapped to archival_id. Technically we can say this is ForeignObject too but we elevate its status to its own slot because we expect it to be mentioned in papers and such.
adc_subject_id also likely to be mentioned in papers and/or supplementary material. Participant is a NamedThing so I'd map this to the name attribute for Participant. This shouldn't be in Specimen. It isn't an identifier per se in the AKC data model, as least the AK will never use it as one, it has the akc_id to do that. It's definitely needed to properly map from AIRR Subject to AKC Participant though.
adc_sample_id likewise as adc_subject_id, I'd map this to the name attribute for the Specimen. Also in similar vein, it's definitely needed to properly map from AIRR Sample to AKC Specimen.

airr-knowledge / issues

Storing researcher assigned identifiers #55