airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

what's the scope of the identifiers in the germline schema objects? #562

Closed schristley closed 1 year ago

schristley commented 2 years ago

This is important in relation to the ADC, both for the API which will allow queries, but also for repositories that store multiple studies that use multiple germline sets.

Additional questions regarding the ADC API:

javh commented 2 years ago

sequence_id is used as the primary key across objects for any observed sequence. Some of these _id fields seem redundant with sequence_id functionally. Eg, RearrangedSequence.repository_id and UnrearrangedSequence.assembly_id may be more appropriate as sequence_id.

germline_set_id would be what populates DataProcessing.germline_database, correct?

williamdlees commented 2 years ago

RearrangedSequence.repository_id would be, for example, the accession id in Genbank. We don't have control over its value. The Unrearranged sequence fields such as assembly_id match this - in that they reflect the common way the sequence is referred to in a database such as ensenbl

keep separate id and ref fields for germline and allele_description- keep ids in as internal linker, ref is external CURIE type. Add a ref to AlleleDescription.

williamdlees commented 2 years ago

As agreed in the standards call on the 21st:

javh commented 2 years ago

Seems good, so this means:

Correct?

Note, this also defines what _ref would mean in #347 and excludes _ref from the suffix options for the ADC extension solution in #589. Which I don't personally have any problem with.

williamdlees commented 2 years ago

I have added longer descriptions and rationalised the use of _id and _ref. Not sure what ypu are asking for in terms of x-airr: the onlty attribute used is nullable in v2 and its use seems consistent to me.

javh commented 2 years ago

Because this will initially be an experimental release we might be able to push back sorting out the proper x-airr tags. But, to answer your question, if you look at Rearrangement:sequence_id you will see:

sequence_id:
  x-airr:
    adc-query-support: true
    identifier: true

Denoting whether the field is required and what its purpose is in the AIRR Data Model and ADC contexts. We'll need to sort that our for the various _id fields at some point. Maybe _ref too, but I don't know how they'll fit in.

williamdlees commented 2 years ago

OK. I agree that sounds like a longer-term issue that may involve the creation of some new tags.

williamdlees commented 1 year ago

Closing this as all issues apart from long-term use of x-airr tags were addressed. Will raise a new issue for the x-airr tags.