Open bcorrie opened 3 months ago
Currently I am storing these in the AKC LinkML that is generated with an adc_
prefix and the field name. For example:
"Specimen": {
"TW01A_B_naive": {
"akc_id": "",
"tissue": {
"id": "UBERON:0013756",
"label": "venous blood"
},
"adc_repertoire_id": "2564613624180576746-242ac113-0001-012",
"adc_data_processing_id": "6414d653-edd2-4d26-be1d-98a82f5e9c98-007",
"adc_study_id": "PRJNA300878",
"adc_subject_id": "TW01A",
"adc_sample_id": "TW01A_B_naive"
},
This related to #56 and storing the source repository IDs such as repertoire_id
etc.
Do we need to store them? At least for the example you give, those IDs can be retrieved from the repertoire object.
I would think the AKC would want to store these. Otherwise one would need to jump back and forth between the AKC and the ADC to understand things at a level that a user would want to if working with the AKC.
If we solve the problem of having URIs (as in https://github.com/airr-knowledge/issues/issues/56) for external objects in external repositories, then yes, we could certainly leave a bunch of this type of info in the ADC repositories. But I would then ask the question what is the AKC doing if not integrating this type of data across the repositories at large (ADC, IEDB, IRAD, OGRDB, VDJBASE).
It doesn't make a lot of sense to me to not store this piece of what I consider pretty critical metadata that describes the investigation in question.
Let me put it this way. If you didn't have a investigator assigned name for a Participant, as far as I can tell when you provide a list of Participants and the metadata about them, the only thing the user would have to differentiate between two different participants is the UUID. I suppose that works, but ugghhh 8-)
Currently I am storing these in the AKC LinkML that is generated with an
adc_
prefix and the field name. For example:"Specimen": { "TW01A_B_naive": { "akc_id": "", "tissue": { "id": "UBERON:0013756", "label": "venous blood" }, "adc_repertoire_id": "2564613624180576746-242ac113-0001-012", "adc_data_processing_id": "6414d653-edd2-4d26-be1d-98a82f5e9c98-007", "adc_study_id": "PRJNA300878", "adc_subject_id": "TW01A", "adc_sample_id": "TW01A_B_naive" },
There are multiple issues involved (as indicated by #56 and #63), and each one has a slightly different mapping.
adc_repertoire_id
is meant to be a globally unique id so that's why it's a ForeignObject
.adc_data_processing_id
is needed for provenance so also ForeignObject
.These two are sufficient to uniquely identify the "source data" from an ADC repository.
adc_study_id
is mapped to archival_id
. Technically we can say this is ForeignObject
too but we elevate its status to its own slot because we expect it to be mentioned in papers and such.adc_subject_id
also likely to be mentioned in papers and/or supplementary material. Participant
is a NamedThing
so I'd map this to the name
attribute for Participant
. This shouldn't be in Specimen
. It isn't an identifier per se in the AKC data model, as least the AK will never use it as one, it has the akc_id
to do that. It's definitely needed to properly map from AIRR Subject
to AKC Participant
though.adc_sample_id
likewise as adc_subject_id
, I'd map this to the name
attribute for the Specimen
. Also in similar vein, it's definitely needed to properly map from AIRR Sample
to AKC Specimen
.
When we convert something like an AIRR
Study
to an AKCInvestigation
, they are identifiers from the AIRR world (e.g.study_id
) that need to be maintained in the AKC. In this case, the AIRR study ID is supposed to be a BioProject like PID andInvestigation
has anarchival_id
which seems to match although I am not 100% sure that is the intent.The other entities such as
Participant
andSpecimen
do not have such terms. In the ADC we store both asubject_id
and asample_id
. These are actually quite valuable fields as they are typically defined in the study and allow the researcher to map back to findings in the paper. We don't have any "researcher assigned" fields in the AKC model. Almost all the "IDs" are internal IDs.