ga4gh-beacon / specification

GA4GH Beacon specification.
Apache License 2.0
32 stars 25 forks source link

About assemblies #270

Open teemukataja opened 5 years ago

teemukataja commented 5 years ago

This is related to https://github.com/ga4gh-beacon/specification/issues/222

Currently the specification describes, that assemblyId should be given in GRCh format. But what if a dataset that is older than GRCh is shared via Beacons, and isn't sequenced using an assembly that is directly translatable to modern assemblies? In beacon-python we started to use this regex ^((GRCh|hg)[0-9]+([.]?p[0-9]+)?)$ for assemblyId validation which allows the following formats:

GRCh37
GRCh37p13
GRCh37.p13
hg19

I found out that the hg notation can be used to some extent, as it has a translation for both NCBI and GRC assemblies. Are there other common assemblies that are used and should be supported? Is there a reason that only GRC notation should be enforced, or should we broaden the allowed assemblies?

I believe @mbaudis might have some knowledge on this matter?

cyenyxe commented 5 years ago

I would strongly recommend using sequence accessions instead of names, because they are completely unambiguous (they clearly refer to a unique version of an assembly), and at the same time can be mapped against multiple names in a GUI for user convenience.

Using sequence accessions would also allow to support non-human species and sequences that are not just assemblies.

mbaudis commented 5 years ago

@cyenyxe A problem here is the support at the resource level, especially when doing federated queries. With the original Beacon being more a "social experiment", it was easier to provide limited, fixed options.

I agree that attributes like assemblyId or chromosome ... should be specified by referencing some external standard, and then specific environments, networks ... can document which values will be supported. I guess this will be part of the current "re-thinking" for v2. Pinging @sdelatorrep @jrambla for taking note.

cyenyxe commented 5 years ago

The standard for sequence accessioning would be that defined by the INSDC consortium, made of the ENA, GenBank and DDBJ. For instance, GCA_000001405.14 identifies GRCh37.p13.