Open andrewyatz opened 3 years ago
I believe this is just one of many examples of issues that we would need to take into account if we include a molecule type array. Limiting ourselves to the alphabet
array would make it easier to draw a line in the sand, as we would then limit ourselves to fields that are about data representation/models and not referring directly to the biological domain. In the alphabet
array, I don't believe there will be any difference between the single and double stranded DNA, but it will be useful for differentiating between the major classes of sequences, e.g. DNA, RNA, proteins, etc..
What are the usage scenarios where such a difference is important? I mean, DNA is per definition double-stranded in its stable form, so isn't single-stranded DNA then a description of the state of the DNA and really information that should be stored as annotations/tracks? Please forgive my ignorance here, as I am a computer scientist without any lab experience.
On the last point it is possible to have single stranded DNA. Such as SARS-CoV-2 which is a single stranded virus. This would be a useful thing to know if there was a reverse strand available or not but it might be fair to say this is out of bounds
And ignorant I was! :) Forgetting about viruses in this day and age is not good optics... It is then is a more central question than I initially believed.
However, https://github.com/ga4gh/seqcol-spec/issues/15#issuecomment-874643381 provides a possible solution without the use of a molecule_type
array.
Working on SARS-CoV-2 has opened my eyes to a lot of other issues out there so don't worry.
@ahwagner asked the question if the molecule type field could differentiate between single and double stranded DNA.