ga4gh-beacon / beacon-v2-Models

Models that leverage the Beacon Framework v2
Apache License 2.0
4 stars 7 forks source link

VRS integration into genomicVariations default schema #72

Closed mbaudis closed 2 years ago

mbaudis commented 2 years ago

This is an attempt to allow the use of VRS variation classes, while also keeping an updated version of the current Beacon schema (using VRS location, which was agreed upon) as an option. This proposal tries to address feedback from the PRC process.

All the additional properties beyond the definition of the variant itself (annotations, caseLevel data etc.) are kept in place.

The change here introduces the root level variation property but other options could be possible.

mrueda commented 2 years ago

Hi Michael, just getting into this now...there are terms in "VRS.location" that are easily transferable (e.g., interval or type), but what about others such as sequence_id?

mbaudis commented 2 years ago

@mrueda VRS does not require computed identifiers, only that CURIEs are being used. See e.g. https://vrs.ga4gh.org/en/stable/impl-guide/example.html . So sequenceId should usually be a refseq id, which then also obviates the need for the definition of a genome edition ("sequenceId": "refseq:NC_000001.10" for chr1, GRCh37).

mbaudis commented 2 years ago

As discussed today (and in several conversations, also with @ahwagner) this seems to be the best way to go ATM. I'm +1 on merge, but leave the step to @jrambla / @MrRobb .

Tom-Shorter commented 2 years ago

I believe that this is a good change that has to be made in one form or another, I'll wait to see what @jrambla says before approving this PR though but I can't see any problems with it currently.

I echo the sentiments made by @mrueda about the mapping of current beacon terms onto VRS. Almost all of these are 1-1 mappings but as pointed out sequence_id is created by a combination of current fields.

I think it's worth taking some time to make a document with all VRS fields and then how these are created from the current beacon fields. Not only will this ensure we are all happy that everything that beacon offers can be handled by VRS it also makes the life of implementers, who need to make the changes, easier as they wont have to do the mappings themselves. This document might also be useful as part of the re-submission to show how Beacon has been changed to better align with external standards.

The below is related but also probably a separate issue

Do we also need to think about the beacons which use different names/identifiers for chromosomes? Some people wont store the refseq id of chromosomes, instead just stating they use (chr)1 of GrCH3X which could be turned into a CURIE along the lines of chr:1 or chromosome:1 or grch37:1. This causes no problems when querying a single beacon instance from a interface designed to query that beacon but as soon as you start querying across multiple beacons this becomes an issue.

I would recommend the inclusion of a statement such as:

Beacon recommends the use of refseq for identifying chromosomes.

This problem will likely appear in more places than just chromosome.

Beacon has done something similar before, in beacon v1 genomic positions were denoted in 0 based half closed (or something similar) notation and this was stated in the documentation so implementers were aware they would need to do some form of conversion to fit the query to their data.

If anyone agrees with this then we should split this off into another issue but if it's not of concern then I'm happy to leave it here.

edit:

just checked and the spec does recommend use of refseq within the description field so this can probably be ignored but it's still worth keeping in mind.

mbaudis commented 2 years ago

@Tom-Shorter Thanks for the elaboration - indeed, this is a "we have to be more prescriptive due to alignment with external standards" issue. But this is on the model side, so supporting clean data delivery... I would argue that we should recommend & document the CURIE-format refseq-id or alternative ga4gh/VRS identifier use.

(The alternative would be to keep the Position object in its previous form for the LegacyVariation while keeping the suggested change to a oneOf root variation => It probably would have made sense to keep the 2 changes for separate discussion.)

jrambla commented 2 years ago

+1 to the mapping documentation wisely suggested by Tom. Part of the conversation above lies in the fundamental design decision on: As Beacon is a discovery but also an entry tool for GA4GH and data sharing, to which end of the "easy vs standard" (or "easy entry vs. precise discovery") continuum it must be shifting. So far, I'd favored "easy" as I believe that using powerful standards comes naturally once you have started to share. Of course that is having a cost on the fostering (enforcing?) standards end of the continuum. Beacon has plenty of such "debates" as it is in a centric position. They are welcome for that reason ;-)

jrambla commented 2 years ago

@mbaudis, naive question: SequenceStateis to be deprecated in VRS2.0. Have we discussed that already?