Open jmarshall opened 4 years ago
For reference sequence names, SAM and VCF (and hence BAM, CRAM, and BCF) have a very specific regular expression that disallows whitespace, backslashes, commas, various quotation marks, and brackets, and also =
or *
as the first character (see samtools/hts-specs#333 and samtools/hts-specs#379):
[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
In htsget and refget, these appear as double-quoted JSON strings. So the SAM regex values would fit in these formats without additional escaping or syntax. OTOH neither htsget nor refget suggests that these values are anything other than arbitrary strings.
For sample identifiers, both SAM and VCF disallow tabs, and VCF currently de facto disallows commas. It is proposed (samtools/hts-specs#414) to be explicit about VCF's commas.
Htsget does not currently tie this down in any way, though it would probably be impractical to use a ?
character and whitespace would be semi-impractical. One proposal for the query
part would also disallow the comma character.
In phenopackets, this appears to be described only as an “arbitrary identifier”.
It might be good to try to restrict the character set further (beyond \
Probably not relevant, but here are some identifier rules in the http://bit.ly/ga4gh-passport-v1 spec:
https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md#affiliationandrole -- specifically the character restrictions on custom roles.
I think GA4GH needs a common set of core data types, with identifier being one of those. The types should be based on existing standards (e.g., ISO) so we don't reinvent the wheel, and they should be as technology/language-agnostic as possible to support implementations in a variety of systems.
Just my 2 cents. Looking forward to the discussion.
@cdvoisin: Thanks, that's interesting. It's a bit different from the very particular individual classes of metadata item that this issue is trying to focus on, as they inherit the existing defined syntax of URLs.
@rrfreimuth: Those are good general principles. But as was hopefully clear in the discussion, this issue is intended to be about specific items of metadata individually and e.g. is just referring to this group of particular items collectively as “identifiers”.
Next steps on this IMHO is to start with (say) reference sequence names, and try to answer the questions posed in the presentation in the April meeting:
Do other working groups (besides LSG with SAM & VCF) have any similar restrictions on their equivalents of reference sequence names?
(We would want to harmonise to the lowest common denominator.)
Should this definition of RNAME identifier be promoted as the GA4GH portable reference sequence name?
If so, how? Ways of doing this include:
Re sample identifiers, I believe Phenopackets has a representation for such a field:
A Biosample refers to a unit of biological material from which the substrate molecules (e.g. genomic DNA, RNA, proteins) for molecular analyses (e.g. sequencing, array hybridisation, mass-spectrometry) are extracted. […] Field Type Status Description id string required arbitrary identifier […]
Example
{ "id": "sample1", "individualId": "patient1", "description": "", […]
@frafrx or other Phenopackets experts: Biosample describes this id
field simply as an “arbitrary identifier”. Does Phenopackets have any other rules about how these ids may be formed? Would Phenopackets wish to align with VCF's rules disallowing tabs and commas (and possibly other punctuation characters to be determined)?
@jmarshall AFAIK in Phenopackets & Beacon we follow the principles of
id
is resource-local (but could use namespaced identifier)id
of a schema defined object (biosample, subject/individual, ...) can be referenced in other schemas with the schema's name (i.e. the id
of a biosample can be referenced in a derived analysis as biosampleId
or biosample_id
)Late to the party, and too late to change for VCF too, but I dislike the word "contig" being used as just another form of sequence. So did Rodger Staden, the person who coined the word "contig": https://staden.sourceforge.net/contig.html Note for the curious this also uses "gel readings", which later just got shortened to the "reads" we know today.
The original definitions have one thing very clear - it's contiguous, without gaps (if I recall Rodger later adimtted he probably meant continous as the reads don't have to abut, just overlap). Genome browsers started muddling things when they didn't understand the difference between a set of overlapping reads forming a contig, and their consensus sequence. They started just using contig instead of consensus sequence, and in doing so lost of the original meaning and caused more confusion. It then got corrupted even further when they stopped caring whether the sequence was even contiguous or not. Sadly that's where VCF ended up. (Although I note it sometimes uses "Chromosome" instead.)
So my preference is definitely for "reference sequence" or similar, and this also fits far better with most of the other use cases here (SAM, BAM, CRAM, Refget, etc). If we're talking about assemblies, then sometimes "consensus sequence" is more appropriate, but the two have largely interchangeable use cases.
There are a number of short identifier-sized pieces of metadata that are used across many GA4GH products. For example:
Reference sequence names
In SAM/BAM/CRAM, this is the
@SQ-SN
header field andRNAME
/RNEXT
/etc fields. In VCF/BCF, it's the##contig
ID
. In htsget, it'sreferenceName
. In refget, it may be returned as analias
.Sample identifiers
In SAM/BAM/CRAM, this is the
@RG-SM
header field. In VCF/BCF, it's the##SAMPLE
ID
and it also appears on the#CHROM
header line. In htsget, it forms the bulk of thepath
part of request URLs, and there is a proposal to encode samples in thequery
part as well (samtools/hts-specs#430). In Phenopackets, it's a Biosample'sid
field.These items of metadata are embedded within the surrounding text using various delimiters in these various formats and protocols. So there are various restrictions on what characters may appear in them so as to avoid conflicting with the delimiter characters or otherwise requiring complicated escaping or encoding mechanisms. It would be good to harmonise these restrictions across GA4GH products, so that a value that was e.g. a valid Sample identifier in one product could be assumed to also be valid in other products.