SynBioDex / SBOL-specification

The Synthetic Biology Open Language (SBOL)
http://sbolstandard.org
15 stars 9 forks source link

Should ComponentDefinition be allowed to have multiple Sequences? #25

Closed jakebeal closed 4 years ago

jakebeal commented 9 years ago

Neil Swainston wrote:

Multiple Sequences per ComponentDefintion. I’m a little troubled by this. As far as I can see, this introduction is causing problems elsewhere in the specification. And the only use case appears to be Matthew’s use of multiple Sequences to define small molecules…

But as a real example for compounds, CHEBI:1606 has the following sequences/structures:

C6H8N2O2 InChI=1S/C6H8N2O2/c1-8-3-5(7-4-8)2-6(9)10/h3-4H,2H2,1H3,(H,9,10) ZHCKPJGJQOPTLB-UHFFFAOYSA-N Cn1cnc(CC(O)=O)c1

I would question whether this is necessary. There is a kind of hierarchy of detail in these definitions - if we have an InChI or InChI Key, we can extract the formula and SMILES string cheminformatically if need be (but we can’t go from formula or SMILES to InChI, for example). I’d personally stick to a single Sequence, and make this as informative as possible. Alternatively, could any of the additional fields be optionally supplied as Annotations? In short, I’m not convinced that this single use-case justifies a change to the spec that has knock-on effects elsewhere. Should we be allowing a ComponentDefinition of a DNA sequence to have multiple Sequences just to encode a number of (maybe redundant) small molecule definitions?

[imported from mailing list]

drdozer commented 9 years ago

I'm finding it invaluable for 2 cases.

I strongly suspect that once we need to store information in addition to the primary sequence of DNA or RNA, such as methylation or other modifications, that we will need a secondary sequence record for biopolymer CDs also. A variant on this would be a protein primary sequence and a secondary sequence record attached to the protein's CD that specifies a phosphorylation state, for example. You can then instantiate many CDs pointing to the same primary amino-acid sequence, pointing to different phosphorylation masks.

The conceptual divergence, I expect, is that I see the Sequence record as a lump of descriptive data and the CD as the entity standing in for the material stuff. Others may see the DNA sequence itself as being the stand-in for the material stuff and the CD as some ephemeral book keeping.

drdozer commented 9 years ago

Another use-case we can support with multiple sequences for e.g. a DNA CD is were the same DNA is stored in different formats in different sequence instances. So, it can be stored as raw IUPAC in one (as recommended), but as a FASTA record in another, a GENBANK numbered sequence block in another and so on. As long as the encoding is specified, this is all technically legal. It would be a shame to outlaw this entirely.

mikebissell commented 9 years ago

One component might have DNA, RNA, and AA sequences.

We routinely store DNA beside AA for internal use.

jakebeal commented 4 years ago

I believe this is now handled by the explicit relationship between Locations and Sequences in SBOL 3.

cjmyers commented 4 years ago

Agreed. Closing.