What characters should be allowed in sequence names?

nsheff commented 3 years ago

The seqcol algorithm will include the names of the sequences within the strings that we hash to form the seqcol digests. For example, chr1. You can read the draft specification here. What should be the allowable set of characters included in these sequence names?

The current proposal is to follow the specification defined for SAM headers: https://samtools.github.io/hts-specs/SAMv1.pdf#subsubsection.1.2.1

In short that is:

Reference sequence names may contain any printable ASCII characters in the range[!-~]apart frombackslashes, commas, quotation marks, and brackets—i.e., apart from ‘\ , "‘’ () [] {} <>’—and may notstart with ‘*’ or ‘=’.4

In theory, we can be more relaxed because seqcol itself doesn't have the same restrictions of a sam file. So, we could define a more permissive set of allowable characters for seqcol purposes. The advantage would be that seqcol could then be used for a wider array of possible sequence collections, even ones that wouldn't work in a sam file; the disadvantage is that perhaps we lose an opportunity to encourage people to standardize, which could prevent some headaches downstream.

andrewyatz commented 3 years ago

Sorry @jmarshall but I think you also have some input here. @daviesrob pointed us towards the SAM spec and I think this would be a good starting point to say we set ourselves against this. Only concern might be VCF compatibility?

jmarshall commented 3 years ago

VCF uses the same rules as SAM for sequence names.

This has been brought to TASC (see ga4gh/TASC#5) with a view to harmonising sequence name rules across GA4GH. As yet no-one from other working groups has identified any similar rules from their formats or protocols with which any harmonisation would be necessary, so I intend soon to propose that the SAM rule be designated as the “GA4GH standard sequence name identifier”.

Formats and protocols that use e.g. JSON have the benefit that they only particularly need to restrict quote characters (") or have escaping rules for them. So they could indeed be more permissive. However that would just lead users to have problems later, when producing SAM or VCF files, rather than having the problem made apparent up front. So IMHO it is better to standardise on the lowest common denominator, to avoid submarine headaches.

nsheff commented 3 years ago

Thanks for the comments. After further discussion today in the refget API meeting, it seems that we are approaching a consensus that we should standardize on SAM sequence name rules.

Before a final decision, we wanted to reach out to a few other communities (plant genomics, microbiomes, proteomics), particularly looking for use cases that may not rely on the SAM toolchain, to see if there's an existing community with a use case that would argue for making a more lenient specification.

@andrewyatz can you ping people who may be able to weigh in?

forum post at biobakery

andrewyatz commented 3 years ago

Already pinged them 👍

nsheff commented 3 years ago

Alright, I've now talked to multiple people from the plant and microbial genomics communities, and have repeatedly heard that following the SAM spec will not be a problem.

I have not yet talked to anyone from the proteomics community, though. @andrewyatz has anyone raised any issues with you? I'd like to finalize this point soon so our discussions in other issues can become more concrete.

andrewyatz commented 3 years ago

My plant, microbes and proteomics contacts have all said they see no issues

sveinugu commented 3 years ago

Just felt like throwing in a hot potato here:

UTF8?

nsheff commented 3 years ago

Just felt like throwing in a hot potato here:

UTF8?

Not sure what you're getting at...

sveinugu commented 3 years ago

Just that non-ascii letters tend to appear in the strangest places, breaking things. There is really no reason for a global standard to require that sequences are named with only ASCII letters, except that most tools would probably fail if people do (which is a good reason). So I am not very serious here. But one could potentially explicitly add the possibility of URL-encoding UTF-8 characters and reserve the %-character for that. To be truly global! Then one can use emoji for naming sequences, which is (let's face it) what we all really want to do anyway! 😉 (edit: had to have a real emoji here)

jmarshall commented 3 years ago

If Unicode characters were allowed in these sequence names, then I hope it's obvious that UTF-8 would be the correct encoding for them :smile: — UTF-8 was carefully constructed so that its representations of non-ASCII characters do not collide with any ASCII characters, so there would be no need to further encode them to avoid colliding with commas or US/RS or any other ASCII delimiter characters.

So there's no technical reason to disallow UTF-8-encoded Unicode characters.

It's basically another facet of the design choice of whether to allow arbitrary characters here in this field in this protocol or whether to align with other common restrictions on sequence names.

BTW e.g. SAM and VCF allow UTF-8 in particular description field values (SAM) and in most field values but not INFO/FORMAT keys (VCF), though not in their equivalents of sequence names.

andrewyatz commented 3 years ago

I like the way you're saying this @jmarshall. Alignment with other common restrictions is a good reason

sveinugu commented 3 years ago

I like the way you're saying this @jmarshall. Alignment with other common restrictions is a good reason

Definitely. Which is why I suggested url-encoding them as that should not create any trouble in any tools, however there will be trouble when such sequence names are visualized in e.g. a GUI, so I guess that might be a bad idea. Which means UTF-8 is probably out of the question. Just wanted to raise the issue though.

daviesrob commented 3 years ago

Another reason to ba careful about Unicode that that there's more than one way to write the same visual character, so you have to introduce normalization rules if you want to test for equivalence. For example "Ç" is u+0037 and "Ç" is u+0043u+0327 (or at least was when I pasted it in).

Adding normalization rules makes the specification much more complicated, and means implementers have to depend on extra libraries to implement it. And that's ignoring sorting and collation, which would need their own set of rules to get a consistent outcome.

jmarshall commented 3 years ago

@daviesrob: Good point — that's a very good technical reason to disallow arbitrary Unicode in fields that you need to do comparisons/sorting/etc on, while perhaps allowing it in free-form text fields (like descriptions) that the computer doesn't need to do anything with other than display to the user (of which, by definition, the unhashed digest string doesn't have any).

nsheff commented 3 years ago

See https://github.com/ga4gh/seqcol-spec/pull/9

nsheff commented 2 years ago

This issue was decided with an ADR entry in PR #9.

ga4gh / refget

What characters should be allowed in sequence names? #2