SEP 033: Concrete descriptions of non-canonical DNA, RNA, and proteins

SynBioDex / SEPs

SBOL Enhancement Proposals

10 stars 16 forks source link

SEP 033: Concrete descriptions of non-canonical DNA, RNA, and proteins #77

Open jonrkarr opened 5 years ago

jonrkarr commented 5 years ago

Full details in the pull request, in the file https://github.com/SynBioDex/SEPs/blob/master/sep_033.md

jakebeal commented 5 years ago

@jonrkarr In theory, supporting BpForms seems pretty straight-forward to me --- it's just another textual sequence format.

Some things about the recommended implementation in the SEP are not clear to me, however:

Is the recommendation to add BpForms as an optional format or to replace IUPAC as the recommended format?
The SEP recommends incorporating the BpForms library into SBOL libraries. BpForms appears to be only available in Python currently, but SBOL libraries are available in multiple languages (Java, Python, C++, Javascript, F#). How do you recommend navigating this issue?

jonrkarr commented 5 years ago

Jake, thanks for the clarifying questions. I'll revise the SEP to try to address these questions.

Yes, the proposal is to add BpForms as a recommended encoding for non-canonical DNA, RNA, and proteins. More concretely, the proposal is simply to add BpForms to Table 1 of the SBOL specification.

Although BpForms can represent any IUPAC sequence, because BpForms is, at least not yet, a standard, we are not recommending replacing IUPAC with BpForms. Allowing both IUPAC and BpForms would allow SBOL to continue to support the existing users, as well as accommodate users who require more chemical information and precision.
The proposal is simply to add BpForms to Table 1 of the SBOL specification.

Going forward, I think it would be useful to incorporate the capabilities to describe and validate non-canonical DNA, RNA, and proteins more directly into SBOL/libSBOL. One potential way is to incorporate the BpForms software into each flavor of libSBOL. Another potential way is to encode non-canonical DNA, RNA, and proteins into RDF. Because there are multiple ways to achieve this, I think this is something that the community should discuss. Until this time, users who need to capture more chemical detail could use the BpForms encoding. This would allow users to begin to explore use cases which require more chemical information, which could anchor discussion about how SBOL should proceed.

Although I think it would be helpful to incorporate the interpretation of BpForms and the other encodings (SMILES, IUPAC, IUBMB) into libSBOL (e.g., this would enable verification of encoded strings), I am also not recommending incorporating the BpForms software into libSBOL at this time because libSBOL does not have currently the ability to interpret the other sequence encodings. To follow this separation, the BpForms software should remain separate from libSBOL.

jakebeal commented 5 years ago

Thanks: this clarification is very helpful.

If we're going to support adding BpForms to Table 1, then we'll definitely need to have some form of support for dealing with BpForms in the SBOL libraries. The full BpForms library might not be necessary, but a number of library operations depend on the ability to reason about locations in sequence strings, e.g., to annotate a location, to check if a sub-component is correctly aligned, or to compose sequences together (in ways more complex than just concatenating).

If I understand correctly, BpForms does not have a 1-to-1 alignment between string index and sequence location. Is there a simple way of computing locations in BpForms, or does that require a significant amount of the BpForms library code?

jonrkarr commented 5 years ago

Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library.

As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal with this by specifying that the atom numbers are in the basis established by OpenBabel.

cjmyers commented 5 years ago

I'm not concerned about the one-to-one alignment issue. Our libraries do already allow for SMILES which Jonathan points out does not have this. We only require this for IUPAC sequences, so we could continue to do so. Indeed, I would expect each object to still have an IUPAC sequence. The BpForms sequence if provided would be added information.

I would though want the libraries to at least be able to tell if the BpForms encoded string was syntactically correct. libSBOLj at least does this for SMILES. I'm not sure if the other libraries due, but for validation I would want to have this ability at least. Do you have a grammar file? Or could you provide a java library that we could link to that would validate the syntax?

On Jul 15, 2019, at 9:03 PM, Jonathan Karr <notifications@github.com mailto:notifications@github.com> wrote:

Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library.

As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal with this by specifying that the atom numbers are in the basis established by OpenBabel.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SEPs/issues/77?email_source=notifications&email_token=AA2YH523RAWQIKNOPYPR7ELP7TCXRA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6VA2Q#issuecomment-511529066, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2YH5Y5VJ5Y7435TKDZBFTP7TCXRANCNFSM4H2HWWLA.

jonrkarr commented 5 years ago

It seems like the encodings need to have well-defined locations, which BpForms does. BpForms uses the same location conventions as IUPAC and SMILES. These are also fairly human-readable as the locations increase monotonically from left to right.

Yes, there's a grammar. FYI, there's a few more sophisticated validations that aren't encoded in the grammar. For example, verifying that 5' caps only appear at the last position. SMILES has the same issue -- it is possible to encode molecules in SMILES that are not physically realistic. I think grammar validation would be a good start. Ideally, validation would go further to verify that the encoded molecules are realistic.

On Mon, Jul 15, 2019 at 3:55 PM cjmyers notifications@github.com wrote:

I'm not concerned about the one-to-one alignment issue. Our libraries do already allow for SMILES which Jonathan points out does not have this. We only require this for IUPAC sequences, so we could continue to do so. Indeed, I would expect each object to still have an IUPAC sequence. The BpForms sequence if provided would be added information.

I would though want the libraries to at least be able to tell if the BpForms encoded string was syntactically correct. libSBOLj at least does this for SMILES. I'm not sure if the other libraries due, but for validation I would want to have this ability at least. Do you have a grammar file? Or could you provide a java library that we could link to that would validate the syntax?

On Jul 15, 2019, at 9:03 PM, Jonathan Karr <notifications@github.com mailto:notifications@github.com> wrote:

Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library.

As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal with this by specifying that the atom numbers are in the basis established by OpenBabel.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/SynBioDex/SEPs/issues/77?email_source=notifications&email_token=AA2YH523RAWQIKNOPYPR7ELP7TCXRA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6VA2Q#issuecomment-511529066>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2YH5Y5VJ5Y7435TKDZBFTP7TCXRANCNFSM4H2HWWLA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SEPs/issues/77?email_source=notifications&email_token=AAVXMKOO2MTKO6X5PJWFGKTP7TI2JA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6ZDLY#issuecomment-511545775, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVXMKL4AAQ3R4ZLKMQQZGLP7TI2JANCNFSM4H2HWWLA .

jonrkarr commented 5 years ago

I updated the SEP to clarify these questions. See PR #83.

cjmyers commented 5 years ago

Sounds good to me. Do you have a Java library that we can use for this?

Chris

On Jul 15, 2019, at 10:17 PM, Jonathan Karr notifications@github.com wrote:

It seems like the encodings need to have well-defined locations, which BpForms does. BpForms uses the same location conventions as IUPAC and SMILES. These are also fairly human-readable as the locations increase monotonically from left to right.

Yes, there's a grammar. FYI, there's a few more sophisticated validations that aren't encoded in the grammar. For example, verifying that 5' caps only appear at the last position. SMILES has the same issue -- it is possible to encode molecules in SMILES that are not physically realistic. I think grammar validation would be a good start. Ideally, validation would go further to verify that the encoded molecules are realistic.

On Mon, Jul 15, 2019 at 3:55 PM cjmyers notifications@github.com wrote:

I'm not concerned about the one-to-one alignment issue. Our libraries do already allow for SMILES which Jonathan points out does not have this. We only require this for IUPAC sequences, so we could continue to do so. Indeed, I would expect each object to still have an IUPAC sequence. The BpForms sequence if provided would be added information.

I would though want the libraries to at least be able to tell if the BpForms encoded string was syntactically correct. libSBOLj at least does this for SMILES. I'm not sure if the other libraries due, but for validation I would want to have this ability at least. Do you have a grammar file? Or could you provide a java library that we could link to that would validate the syntax?

On Jul 15, 2019, at 9:03 PM, Jonathan Karr <notifications@github.com mailto:notifications@github.com> wrote:

Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library.

As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal with this by specifying that the atom numbers are in the basis established by OpenBabel.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/SynBioDex/SEPs/issues/77?email_source=notifications&email_token=AA2YH523RAWQIKNOPYPR7ELP7TCXRA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6VA2Q#issuecomment-511529066>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2YH5Y5VJ5Y7435TKDZBFTP7TCXRANCNFSM4H2HWWLA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SEPs/issues/77?email_source=notifications&email_token=AAVXMKOO2MTKO6X5PJWFGKTP7TI2JA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6ZDLY#issuecomment-511545775, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVXMKL4AAQ3R4ZLKMQQZGLP7TI2JANCNFSM4H2HWWLA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SEPs/issues/77?email_source=notifications&email_token=AA2YH5ZTZ5IU6DAGI3GSEJDP7TLPJA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ63AFA#issuecomment-511553556, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2YH5ZASN6IHVYY4K6RGRTP7TLPJANCNFSM4H2HWWLA.

jonrkarr commented 5 years ago

The grammar is written for Lark in EBNF (with a few slight modifications described here).

We plan to make a more standard version of the grammar that can be used with other parser generators. I think this will only require a few small changes. There are several parser generators for C, C++, Java, Python, etc. that support EBNF such as:

ANTLR
Beaver
Coco/R
Grammatica
JavaCC
UniCC

If you have a preferred library, we can target that.

graik commented 4 years ago

This is proposing that the sequence string of a SBOL record could be embedded with an orthogonal data structure that has nothing to do with neither RDF or SBOL. BpForms has, as it seems, been only just proposed this year. No offense Jon but writing this SEP while also being the author of BpForms... I think this mostly serves the promotion of the BpForms project.

I believe it would be a bad idea to allow a complex and, so far, non-standard syntax to enter SBOL sequence strings. Right now, sequence strings are IUPAC compatible and can be parsed and processed by pretty much any bioinformatics tool. With this SEP, this compatibility would be broken.

Note also, that all examples in the SEP are highly specialized natural modifications of natural molecules. None of the examples comes from engineered systems. I would argue that there is no urgent need for the description of this kind of non-canonical chemistry in bioengineering (and we do support SMILES, which is a standard already supported by many tools). It would be nice if we could describe a handful of very common modifications but that could be achieved much easier (and without breaking the sequence record).

cjmyers commented 4 years ago

@graik You are correct that this SEP is about promotion of BpForms for expressing this type of information. Without this SEP, it is perfectly legal to use BpForms in SBOL already. The Sequence encoding types are not restricted to just IUPAC and SMILES, but they are currently the only two that the SBOL community recommends to use. This SEP does not say that people must use BpForms, but rather it would say that if you want to express the types of structure that it can represent that it is a suggested way to represent this. BpForms would not replace any of our existing encodings, but it would be an alternative encoding.

To be honest, I'm not sure how useful this is to synthetic biologists or how often this type of information needs to be recorded. Since it is already allowed, the question is really is there a better way to express this information?

jamesamcl commented 4 years ago

I am reviewing this on behalf of the SBOL editors. From what I can gather, there are no outstanding questions about the content of the SEP, but only about how useful BpForms would be - something I don't think is really up to us to decide.

We can take this either take this to a vote now for SBOL 2.4, or we can defer it until after 3.0. @jonrkarr - do you have a preference?

jonrkarr commented 4 years ago

Hi James,

Whatever the editors recommend work for me.

Regarding the utility, this SEP is about enhancing the chemical precision of macromolecules in synthetic designs. While some projects may not need more precision at the current early stage of synthetic biology, the SEP is about facilitating more chemical precision for the projects which need it, will I think will grow in the future. This could include describing critical RNA and protein modifications to describing entirely new genetic codes. For example, we have a project to build cells with mirror chiral proteins that would require something like BpForms to describe the design for the cells. At the moment, there does not appear to be a concrete way to describe modifications within SBOL, let alone designs that involve new genetic codes with new amino acids and/or new peptide bonds.

I think one underlying question for the community is should SBOL be able to capture designs for entirely new organisms that may depart significantly from natural biology and that may involve entirely new parts? For example, should SBOL be able to describe organisms that involve new genetic codes. In that case, I think it will be important for SBOL to capture more information that is normally implicit in our shared biochemical knowledge.

Regards Jonathan

On Sat, Dec 14, 2019 at 5:16 PM James McLaughlin notifications@github.com wrote:

I am reviewing this on behalf of the SBOL editors. From what I can gather, there are no outstanding questions about the content of the SEP, but only about how useful BpForms would be - something I don't think is really up to us to decide.

We can take this either take this to a vote now for SBOL 2.4, or we can defer it until after 3.0. @jonrkarr https://github.com/jonrkarr - do you have a preference?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SEPs/issues/77?email_source=notifications&email_token=AAVXMKJOWVNYCWSPVDQV533QYVLMTA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4MQNA#issuecomment-565758004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVXMKNWQP3O6RJ7I6ZSK73QYVLMTANCNFSM4H2HWWLA .

cjmyers commented 4 years ago

@jonrkarr Given the concerns raised, I think maybe we should defer this to SBOL3 for now. It can already be used by SBOL2 now, since this is just another possible sequence encoding. I think to advocate its use as a best practice, the community would need to see some (perhaps at least 3) uses of BpForms in synthetic biology examples. If you or collaborators have some examples to share, please do so. We also started a conversation earlier about using constraints and/or interactions to represent some elements of BpForms. This conversation would be interesting to begin again, especially if it may suggest some changes that we should consider for SBOL3. Thanks again for your contributions.