what's the scope of the identifiers in the germline schema objects?

schristley commented 2 years ago

This is important in relation to the ADC, both for the API which will allow queries, but also for repositories that store multiple studies that use multiple germline sets.

[x] germline_set_id. I believe this should be a globally unique identifier. It's important that this identifier being resolvable, i.e. there is a clear procedure to retrieve the germline set data from its id. Services that manage and generate identifiers should probably use the decentralized identifier standard. Individuals that wish to publish their own germline sets should likely use such a service so they can be provided with a resolvable, persistent identifier.
[x] allele_description_id. Is this only for internally linking GermlineSet and AlleleDescription? If so, are ADC repositories allowed to assign whatever value they wish to insure uniqueness? Also this implies that GermlineSet and AlleleDescription data must always be kept "together" otherwise the linkage can be lost.
[x] sequence_id. Is this only for internally linking in AlleleDescription?
[x] sequence_delineation_id. Same as sequence_id?
[x] receptor_genotype_set_id and receptor_genotype_id?
[x] mhc_genotype_set_id and mhc_genotype_id?

Additional questions regarding the ADC API:

[x] What query end points should we have? A germline end point seems obvious. An allele end point?
[x] If germline_set_id is a globally unique identifier then the API should be straight forward.
[x] For an allele end point, what is the identifier that will return the unique record?

javh commented 2 years ago

sequence_id is used as the primary key across objects for any observed sequence. Some of these _id fields seem redundant with sequence_id functionally. Eg, RearrangedSequence.repository_id and UnrearrangedSequence.assembly_id may be more appropriate as sequence_id.

germline_set_id would be what populates DataProcessing.germline_database, correct?

williamdlees commented 2 years ago

RearrangedSequence.repository_id would be, for example, the accession id in Genbank. We don't have control over its value. The Unrearranged sequence fields such as assembly_id match this - in that they reflect the common way the sequence is referred to in a database such as ensenbl

keep separate id and ref fields for germline and allele_description- keep ids in as internal linker, ref is external CURIE type. Add a ref to AlleleDescription.

williamdlees commented 2 years ago

As agreed in the standards call on the 21st:

the ids are unique identifiers
refs are CURIE-style references
both GermlineSet and AlleleDescrioption should have refs, so that they can be queried via an API
having reviewed the schema, I propose to retain all defined unique identifiers, as they will be helpful in composing and parsing the files (for example to identify repeated use of the same values)

javh commented 2 years ago

Seems good, so this means:

RearrangedSequence:sequence_id - clarify wording and x-airr attributes.
RearrangedSequence:repository_id - rename to sequence_ref.
UnrearrangedSequence:sequence_id - clarify wording and x-airr attributes.
UnrearrangedSequence:assembly_id - rename to sequence_ref.
And so on...

Correct?

Note, this also defines what _ref would mean in #347 and excludes _ref from the suffix options for the ADC extension solution in #589. Which I don't personally have any problem with.

williamdlees commented 2 years ago

I have added longer descriptions and rationalised the use of _id and _ref. Not sure what ypu are asking for in terms of x-airr: the onlty attribute used is nullable in v2 and its use seems consistent to me.

javh commented 2 years ago

Because this will initially be an experimental release we might be able to push back sorting out the proper x-airr tags. But, to answer your question, if you look at Rearrangement:sequence_id you will see:

sequence_id:
  x-airr:
    adc-query-support: true
    identifier: true

Denoting whether the field is required and what its purpose is in the AIRR Data Model and ADC contexts. We'll need to sort that our for the various _id fields at some point. Maybe _ref too, but I don't know how they'll fit in.

williamdlees commented 2 years ago

OK. I agree that sounds like a longer-term issue that may involve the creation of some new tags.

williamdlees commented 1 year ago

Closing this as all issues apart from long-term use of x-airr tags were addressed. Will raise a new issue for the x-airr tags.

airr-community / airr-standards

what's the scope of the identifiers in the germline schema objects? #562