Open teemukataja opened 5 years ago
I would strongly recommend using sequence accessions instead of names, because they are completely unambiguous (they clearly refer to a unique version of an assembly), and at the same time can be mapped against multiple names in a GUI for user convenience.
Using sequence accessions would also allow to support non-human species and sequences that are not just assemblies.
@cyenyxe A problem here is the support at the resource level, especially when doing federated queries. With the original Beacon being more a "social experiment", it was easier to provide limited, fixed options.
I agree that attributes like assemblyId
or chromosome ... should be specified by referencing some external standard, and then specific environments, networks ... can document which values will be supported. I guess this will be part of the current "re-thinking" for v2. Pinging @sdelatorrep @jrambla for taking note.
The standard for sequence accessioning would be that defined by the INSDC consortium, made of the ENA, GenBank and DDBJ. For instance, GCA_000001405.14 identifies GRCh37.p13.
This is related to https://github.com/ga4gh-beacon/specification/issues/222
Currently the specification describes, that assemblyId should be given in
GRCh
format. But what if a dataset that is older than GRCh is shared via Beacons, and isn't sequenced using an assembly that is directly translatable to modern assemblies? Inbeacon-python
we started to use this regex^((GRCh|hg)[0-9]+([.]?p[0-9]+)?)$
forassemblyId
validation which allows the following formats:I found out that the
hg
notation can be used to some extent, as it has a translation for bothNCBI
andGRC
assemblies. Are there other common assemblies that are used and should be supported? Is there a reason that only GRC notation should be enforced, or should we broaden the allowed assemblies?I believe @mbaudis might have some knowledge on this matter?