"U" (selenocysteine) in aa sequence throws an exception

jamesamcl commented 7 years ago

Reproducable with:

String aa = "MSFSRLYRLLKPALLCGALAAPGLASTMCASRDDWRCARSMHEFSAKDIDGRMVNLDKYRGHVCIVTNVASQUGKTDVNYTQLVDLHARYAECGLRILAFPCNQFGRQEPGSNAEIKEFAAGYNVKFDLFSKICVNGDDAHPLWKWMKVQPKGRGMLGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEKDLPCYL"

Sequence sequence = doc.createSequence("Q9N2J2", "1", aa, Sequence.IUPAC_PROTEIN);

Example protein: http://www.uniprot.org/uniprot/Q9N2J2

Exception in thread "main" org.sbolstandard.core2.SBOLValidationException: sbol-10405: The elements property of a Sequence MUST be consistent with its encoding property. Reference: SBOL Version 2.1.0 Section 7.6 on page 20 : http://synbiohub.org/swissprot/Q9N2J2/1 at org.sbolstandard.core2.Sequence.setElements(Sequence.java:115) at org.sbolstandard.core2.Sequence.(Sequence.java:63) at org.sbolstandard.core2.SBOLDocument.createSequence(SBOLDocument.java:1048) at org.sbolstandard.core2.SBOLDocument.createSequence(SBOLDocument.java:1018) at UniProtToSbol.parseEntry(UniProtToSbol.java:119) at UniProtToSbol.main(UniProtToSbol.java:45)

jamesamcl commented 7 years ago

Actually, this is compliant with the IUPAC protein spec. Closing

jamesamcl commented 7 years ago

...and re-opening. Further investigation reveals the IUPAC protein spec supports "U" for selenocysteine.

http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA212

jamesamcl commented 7 years ago

It might be worth looking at what BioPython does for this: http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC-pysrc.html

  """Extended uppercase IUPAC protein single letter alphabet including X etc. 

  In addition to the standard 20 single letter protein codes, this includes: 

   - B = "Asx";  Aspartic acid (R) or Asparagine (N) 
   - X = "Xxx";  Unknown or 'other' amino acid 
   - Z = "Glx";  Glutamic acid (E) or Glutamine (Q) 
   - J = "Xle";  Leucine (L) or Isoleucine (I), used in mass-spec (NMR) 
   - U = "Sec";  Selenocysteine 
   - O = "Pyl";  Pyrrolysine 

  This alphabet is not intended to be used with X for Selenocysteine 
  (an ad-hoc standard prior to the IUPAC adoption of U instead). 
  """

cjmyers commented 7 years ago

I based it off of this page:

http://www.bioinformatics.org/sms2/iupac.html http://www.bioinformatics.org/sms2/iupac.html

However, you are correct that the specification that we cite does indeed include “U”. I could not find though J and O in our cited document. Could you?

In any case, I will go ahead and update the validator to allow any alpha character for a protein sequence then.

On Nov 23, 2016, at 1:28 PM, James Alastair McLaughlin notifications@github.com wrote:

It might be worth looking at what BioPython does for this: http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC-pysrc.html http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC-pysrc.html """Extended uppercase IUPAC protein single letter alphabet including X etc.

In addition to the standard 20 single letter protein codes, this includes:

B = "Asx"; Aspartic acid (R) or Asparagine (N)

X = "Xxx"; Unknown or 'other' amino acid

Z = "Glx"; Glutamic acid (E) or Glutamine (Q)

J = "Xle"; Leucine (L) or Isoleucine (I), used in mass-spec (NMR)

U = "Sec"; Selenocysteine

O = "Pyl"; Pyrrolysine

This alphabet is not intended to be used with X for Selenocysteine (an ad-hoc standard prior to the IUPAC adoption of U instead). """ — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/libSBOLj/issues/411#issuecomment-262512420, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWD967EZIprcphHE1wA37nQ-hNtb-mFks5rBD-ZgaJpZM4K6j76.

cjmyers commented 7 years ago

Fixed in develop branch.

SynBioDex / libSBOLj

"U" (selenocysteine) in aa sequence throws an exception #411