Would molecule designation also know about how many strands there are?

andrewyatz commented 3 years ago

@ahwagner asked the question if the molecule type field could differentiate between single and double stranded DNA.

sveinugu commented 3 years ago

I believe this is just one of many examples of issues that we would need to take into account if we include a molecule type array. Limiting ourselves to the alphabet array would make it easier to draw a line in the sand, as we would then limit ourselves to fields that are about data representation/models and not referring directly to the biological domain. In the alphabet array, I don't believe there will be any difference between the single and double stranded DNA, but it will be useful for differentiating between the major classes of sequences, e.g. DNA, RNA, proteins, etc..

What are the usage scenarios where such a difference is important? I mean, DNA is per definition double-stranded in its stable form, so isn't single-stranded DNA then a description of the state of the DNA and really information that should be stored as annotations/tracks? Please forgive my ignorance here, as I am a computer scientist without any lab experience.

andrewyatz commented 3 years ago

On the last point it is possible to have single stranded DNA. Such as SARS-CoV-2 which is a single stranded virus. This would be a useful thing to know if there was a reverse strand available or not but it might be fair to say this is out of bounds

sveinugu commented 3 years ago

And ignorant I was! :) Forgetting about viruses in this day and age is not good optics... It is then is a more central question than I initially believed.

However, https://github.com/ga4gh/seqcol-spec/issues/15#issuecomment-874643381 provides a possible solution without the use of a molecule_type array.

andrewyatz commented 3 years ago

Working on SARS-CoV-2 has opened my eyes to a lot of other issues out there so don't worry.

ga4gh / refget

Would molecule designation also know about how many strands there are? #16