fabianegli / bioenc

Expanding the Bio-Sequence Encoding
0 stars 0 forks source link

Have combining diacretic marks been considered? #1

Open JervenBolleman opened 5 years ago

fabianegli commented 5 years ago

I did consider them at some point, but they seem to be too limited to represent all possible, or even the most prevalent, modifications and bring some additional problems along. There are a few ways diacritics could be used, and all of them seem to either miss the point of this project or introduce more problems than they solve:

(A) One possibility would be using precomposed Unicode diacritics, which would result in single codepoints for modified sequence components and thus follow the intended reduction of the positional sequence information to one codepoint. While ensuring legible visual representation, the set of precomposed Unicode characters is very limited and would only allow a small expansion of the sequence alphabet compared to the proposed expansion into other codepoint planes of the UTF-8 spec.

(B) The use of single diacritics on letters that are not precomposed in UTF-8. While vastly expanding the number of encodable modifications, this approach is still restricted to the use of the ~30 diacritics of the Latin alphabet and thus does not allow that many PTMs to be encoded. If we make the effort to expand the encoding to allow PTMs, we should try to allow the encoding of a considerable amount of PTMs including the various branched glycosylations.

(C) The use of combinations of diacritics to encode the modifications would bring such a vast increase. However, the combination of different diacritics can be misleading and might lead to various problems when it comes to representation of the glyph and diacritics when the base character is not designed to cope with one or multiple of the diacritics or the diacritics are incompatible themselves. Additionally to the beforementioned issues, this approach requires the use of separate codepoints for the diacritics which contradicts the basic idea of this project that aims to encode biosequence components in single codepoints.

In addition to the points above, some diacritics are visually similar and can be easily mixed up/misread. This is especially the case when there is a visual interaction between a letter and its diacritic or two diacritics of the same letter, the font is visually represented too small or not designed to allow the chosen diacritics for the letters they can occur with. The last issue might lead to plain wrong visual representations of sequences.

fabianegli commented 5 years ago

A side note: The expanded IUPAC standard for nucleotide sequences (doi) defines how to represent ambiguities in nucleotides in sequences using various ASCII codes - including diacritics and symbols as well as case and formatting (bold, underline). However, formatting is not conveyed in plain text files - at least not without extra effort to encode the formatting. The desire to remain legible/printable and the continued use of ASCII for ambiguity codes were presumably the main factors to choose different formatting to distinguish the letters summarizing up to 3 nucleotides. The same constraints apparently led to the use of diacritics with and without letters, punctuation and mathematical symbols as symbols/codepoints for ambiguities including all 4 nucleotides. A short search for citing research (only 4 citations crossref) showed some adoption for the fundamental ambiguity codes, but not fort the quantitative ambiguities. So the question arises: are there datasets/database/software that uses the quantitative ambiguity codes?