airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

Storing researcher assigned identifiers #55

Open bcorrie opened 3 months ago

bcorrie commented 3 months ago

When we convert something like an AIRR Study to an AKC Investigation, they are identifiers from the AIRR world (e.g. study_id) that need to be maintained in the AKC. In this case, the AIRR study ID is supposed to be a BioProject like PID and Investigation has an archival_id which seems to match although I am not 100% sure that is the intent.

The other entities such as Participant and Specimen do not have such terms. In the ADC we store both a subject_id and a sample_id. These are actually quite valuable fields as they are typically defined in the study and allow the researcher to map back to findings in the paper. We don't have any "researcher assigned" fields in the AKC model. Almost all the "IDs" are internal IDs.

bcorrie commented 3 months ago

Currently I am storing these in the AKC LinkML that is generated with an adc_ prefix and the field name. For example:

    "Specimen": {
      "TW01A_B_naive": {
        "akc_id": "",
        "tissue": {
          "id": "UBERON:0013756",
          "label": "venous blood"
        },
        "adc_repertoire_id": "2564613624180576746-242ac113-0001-012",
        "adc_data_processing_id": "6414d653-edd2-4d26-be1d-98a82f5e9c98-007",
        "adc_study_id": "PRJNA300878",
        "adc_subject_id": "TW01A",
        "adc_sample_id": "TW01A_B_naive"
      },
bcorrie commented 3 months ago

This related to #56 and storing the source repository IDs such as repertoire_id etc.

schristley commented 3 months ago

Do we need to store them? At least for the example you give, those IDs can be retrieved from the repertoire object.

bcorrie commented 2 weeks ago

I would think the AKC would want to store these. Otherwise one would need to jump back and forth between the AKC and the ADC to understand things at a level that a user would want to if working with the AKC.

If we solve the problem of having URIs (as in https://github.com/airr-knowledge/issues/issues/56) for external objects in external repositories, then yes, we could certainly leave a bunch of this type of info in the ADC repositories. But I would then ask the question what is the AKC doing if not integrating this type of data across the repositories at large (ADC, IEDB, IRAD, OGRDB, VDJBASE).

It doesn't make a lot of sense to me to not store this piece of what I consider pretty critical metadata that describes the investigation in question.

bcorrie commented 2 weeks ago

Let me put it this way. If you didn't have a investigator assigned name for a Participant, as far as I can tell when you provide a list of Participants and the metadata about them, the only thing the user would have to differentiate between two different participants is the UUID. I suppose that works, but ugghhh 8-)

schristley commented 2 weeks ago

Currently I am storing these in the AKC LinkML that is generated with an adc_ prefix and the field name. For example:

    "Specimen": {
      "TW01A_B_naive": {
        "akc_id": "",
        "tissue": {
          "id": "UBERON:0013756",
          "label": "venous blood"
        },
        "adc_repertoire_id": "2564613624180576746-242ac113-0001-012",
        "adc_data_processing_id": "6414d653-edd2-4d26-be1d-98a82f5e9c98-007",
        "adc_study_id": "PRJNA300878",
        "adc_subject_id": "TW01A",
        "adc_sample_id": "TW01A_B_naive"
      },

There are multiple issues involved (as indicated by #56 and #63), and each one has a slightly different mapping.

These two are sufficient to uniquely identify the "source data" from an ADC repository.