Closed schristley closed 1 year ago
sequence_id
is used as the primary key across objects for any observed sequence. Some of these _id
fields seem redundant with sequence_id
functionally. Eg, RearrangedSequence.repository_id
and UnrearrangedSequence.assembly_id
may be more appropriate as sequence_id
.
germline_set_id
would be what populates DataProcessing.germline_database
, correct?
RearrangedSequence.repository_id
would be, for example, the accession id in Genbank. We don't have control over its value.
The Unrearranged sequence fields such as assembly_id match this - in that they reflect the common way the sequence is referred to in a database such as ensenbl
keep separate id
and ref
fields for germline
and allele_description
- keep ids in as internal linker, ref
is external CURIE type. Add a ref to AlleleDescription
.
As agreed in the standards call on the 21st:
Seems good, so this means:
RearrangedSequence:sequence_id
- clarify wording and x-airr attributes.RearrangedSequence:repository_id
- rename to sequence_ref
.UnrearrangedSequence:sequence_id
- clarify wording and x-airr attributes.UnrearrangedSequence:assembly_id
- rename to sequence_ref
.Correct?
Note, this also defines what _ref
would mean in #347 and excludes _ref
from the suffix options for the ADC extension solution in #589. Which I don't personally have any problem with.
I have added longer descriptions and rationalised the use of _id and _ref. Not sure what ypu are asking for in terms of x-airr: the onlty attribute used is nullable in v2 and its use seems consistent to me.
Because this will initially be an experimental release we might be able to push back sorting out the proper x-airr tags. But, to answer your question, if you look at Rearrangement:sequence_id
you will see:
sequence_id:
x-airr:
adc-query-support: true
identifier: true
Denoting whether the field is required and what its purpose is in the AIRR Data Model and ADC contexts. We'll need to sort that our for the various _id
fields at some point. Maybe _ref
too, but I don't know how they'll fit in.
OK. I agree that sounds like a longer-term issue that may involve the creation of some new tags.
Closing this as all issues apart from long-term use of x-airr tags were addressed. Will raise a new issue for the x-airr tags.
This is important in relation to the ADC, both for the API which will allow queries, but also for repositories that store multiple studies that use multiple germline sets.
germline_set_id
. I believe this should be a globally unique identifier. It's important that this identifier being resolvable, i.e. there is a clear procedure to retrieve the germline set data from its id. Services that manage and generate identifiers should probably use the decentralized identifier standard. Individuals that wish to publish their own germline sets should likely use such a service so they can be provided with a resolvable, persistent identifier.allele_description_id
. Is this only for internally linkingGermlineSet
andAlleleDescription
? If so, are ADC repositories allowed to assign whatever value they wish to insure uniqueness? Also this implies thatGermlineSet
andAlleleDescription
data must always be kept "together" otherwise the linkage can be lost.sequence_id
. Is this only for internally linking inAlleleDescription
?sequence_delineation_id
. Same assequence_id
?receptor_genotype_set_id
andreceptor_genotype_id
?mhc_genotype_set_id
andmhc_genotype_id
?Additional questions regarding the ADC API:
germline
end point seems obvious. Anallele
end point?germline_set_id
is a globally unique identifier then the API should be straight forward.allele
end point, what is the identifier that will return the unique record?