Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols

ga4gh / TASC

TASC aids the harmonisation of aspects of GA4GH's various products that would otherwise prevent different products from being used together conveniently.

https://www.ga4gh.org

8 stars 7 forks source link

Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5

Open jmarshall opened 4 years ago

jmarshall commented 4 years ago

There are a number of short identifier-sized pieces of metadata that are used across many GA4GH products. For example:

Reference sequence names

In SAM/BAM/CRAM, this is the @SQ-SN header field and RNAME/RNEXT/etc fields. In VCF/BCF, it's the ##contig ID. In htsget, it's referenceName. In refget, it may be returned as an alias.
Sample identifiers

In SAM/BAM/CRAM, this is the @RG-SM header field. In VCF/BCF, it's the ##SAMPLE ID and it also appears on the #CHROM header line. In htsget, it forms the bulk of the path part of request URLs, and there is a proposal to encode samples in the query part as well (samtools/hts-specs#430). In Phenopackets, it's a Biosample's id field.

These items of metadata are embedded within the surrounding text using various delimiters in these various formats and protocols. So there are various restrictions on what characters may appear in them so as to avoid conflicting with the delimiter characters or otherwise requiring complicated escaping or encoding mechanisms. It would be good to harmonise these restrictions across GA4GH products, so that a value that was e.g. a valid Sample identifier in one product could be assumed to also be valid in other products.

jmarshall commented 4 years ago

For reference sequence names, SAM and VCF (and hence BAM, CRAM, and BCF) have a very specific regular expression that disallows whitespace, backslashes, commas, various quotation marks, and brackets, and also = or * as the first character (see samtools/hts-specs#333 and samtools/hts-specs#379):

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

In htsget and refget, these appear as double-quoted JSON strings. So the SAM regex values would fit in these formats without additional escaping or syntax. OTOH neither htsget nor refget suggests that these values are anything other than arbitrary strings.

jmarshall commented 4 years ago

For sample identifiers, both SAM and VCF disallow tabs, and VCF currently de facto disallows commas. It is proposed (samtools/hts-specs#414) to be explicit about VCF's commas.

Htsget does not currently tie this down in any way, though it would probably be impractical to use a ? character and whitespace would be semi-impractical. One proposal for the query part would also disallow the comma character.

In phenopackets, this appears to be described only as an “arbitrary identifier”.

It might be good to try to restrict the character set further (beyond \ and \) to free up more punctuation characters for convenient use as format delimiters and on tools' command lines, but this would require analysis of what constitutes a sample name in the wild at present. For reference sequence names we collected fairly extensive statistics (see https://github.com/samtools/hts-specs/pull/333#issuecomment-422778596), but I am unaware of that having been done for sample name identifiers.

cdvoisin commented 4 years ago

Probably not relevant, but here are some identifier rules in the http://bit.ly/ga4gh-passport-v1 spec:

https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md#custom-passport-visa-types
https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md#url-fields
https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md#affiliationandrole -- specifically the character restrictions on custom roles.

rrfreimuth commented 4 years ago

I think GA4GH needs a common set of core data types, with identifier being one of those. The types should be based on existing standards (e.g., ISO) so we don't reinvent the wheel, and they should be as technology/language-agnostic as possible to support implementations in a variety of systems.

Just my 2 cents. Looking forward to the discussion.

jmarshall commented 4 years ago

@cdvoisin: Thanks, that's interesting. It's a bit different from the very particular individual classes of metadata item that this issue is trying to focus on, as they inherit the existing defined syntax of URLs.

@rrfreimuth: Those are good general principles. But as was hopefully clear in the discussion, this issue is intended to be about specific items of metadata individually and e.g. is just referring to this group of particular items collectively as “identifiers”.

jmarshall commented 4 years ago

Next steps on this IMHO is to start with (say) reference sequence names, and try to answer the questions posed in the presentation in the April meeting:

Do other working groups (besides LSG with SAM & VCF) have any similar restrictions on their equivalents of reference sequence names?

(We would want to harmonise to the lowest common denominator.)
Should this definition of RNAME identifier be promoted as the GA4GH portable reference sequence name?

If so, how? Ways of doing this include:
- Incorporate this regexp or equivalent into other GA4GH standards explicitly
- Add references in other GA4GH standards to some pan-GA4GH location describing this pan-GA4GH standard RNAME building block
- Or simply note in other GA4GH standards that “SAM has rules restricting the characters used in RNAMEs — see the SAM specification for details”

jmarshall commented 4 years ago

Re sample identifiers, I believe Phenopackets has a representation for such a field:

A Biosample refers to a unit of biological material from which the substrate molecules (e.g. genomic DNA, RNA, proteins) for molecular analyses (e.g. sequencing, array hybridisation, mass-spectrometry) are extracted. […] Field Type Status Description

id string required arbitrary identifier

[…]

Example
{
  "id": "sample1",
  "individualId": "patient1",
  "description": "",
  […]

A Biosample refers to a unit of biological material from which the substrate molecules (e.g. genomic DNA, RNA, proteins) for molecular analyses (e.g. sequencing, array hybridisation, mass-spectrometry) are extracted. […] Field	Type	Status	Description
id	string	required	arbitrary identifier

@frafrx or other Phenopackets experts: Biosample describes this id field simply as an “arbitrary identifier”. Does Phenopackets have any other rules about how these ids may be formed? Would Phenopackets wish to align with VCF's rules disallowing tabs and commas (and possibly other punctuation characters to be determined)?

mbaudis commented 1 year ago

@jmarshall AFAIK in Phenopackets & Beacon we follow the principles of

id is resource-local (but could use namespaced identifier)
the id of a schema defined object (biosample, subject/individual, ...) can be referenced in other schemas with the schema's name (i.e. the id of a biosample can be referenced in a derived analysis as biosampleId or biosample_id)

jkbonfield commented 1 year ago

Late to the party, and too late to change for VCF too, but I dislike the word "contig" being used as just another form of sequence. So did Rodger Staden, the person who coined the word "contig": https://staden.sourceforge.net/contig.html Note for the curious this also uses "gel readings", which later just got shortened to the "reads" we know today.

The original definitions have one thing very clear - it's contiguous, without gaps (if I recall Rodger later adimtted he probably meant continous as the reads don't have to abut, just overlap). Genome browsers started muddling things when they didn't understand the difference between a set of overlapping reads forming a contig, and their consensus sequence. They started just using contig instead of consensus sequence, and in doing so lost of the original meaning and caused more confusion. It then got corrupted even further when they stopped caring whether the sequence was even contiguous or not. Sadly that's where VCF ended up. (Although I note it sometimes uses "Chromosome" instead.)

So my preference is definitely for "reference sequence" or similar, and this also fits far better with most of the other use cases here (SAM, BAM, CRAM, Refget, etc). If we're talking about assemblies, then sometimes "consensus sequence" is more appropriate, but the two have largely interchangeable use cases.