SynBioDex / pySBOL

SWIG-Python wrappers implementing SBOL 2.2
Apache License 2.0
25 stars 8 forks source link

write/export functions crash Colab Notebook #133

Closed eyesmo closed 2 years ago

eyesmo commented 3 years ago

Hi! I'm working on learning to use pySBOL for my wetware design work. Towards that end, I've been trying to take some genetic designs stored in Benchling, and re-build them as SBOL documents with pySBOL, in a Colab notebook.

I've gotten to the 'last' step, of writing the design to a .xml or exporting the design to a .gb file. However, when I try to run either result = doc.write(folderPath + 'AF_SpisPink_Cassette.xml')

or

doc.exportToFormat('GenBank', folderPath + 'AF_SpisPink_Cassette.gb')

the Colab notebook pops up a little window that says 'your session crashed for an unknown reason,' like so:

Screen Shot 2021-02-02 at 6 11 18 PM

When I then click on "View runtime logs," here's what I see:

Screen Shot 2021-02-02 at 6 11 45 PM

Below is the code I was working on to create the SBOL design. Apologies if it's ugly/wrong--I'm just in the early stages of learning this package and the SBOL data model.

As a side note, are there any repositories of example SBOL designs created and visualized with pySBOL and related packages, similar to the [sample plots for MatPlotLib]?

Thanks!

Code:

#Import pySBOL package
import sbol as sb
from sbol import *

#Now, let's try to build a FreeGenes part in pySBOL
#Let's try to build the AF_SpisPink_Cassette part, which allows people to do
#pink/white colony screening to increase cloning efficiency:
#https://benchling.com/openbioeconomy/f/lib_RSHKnK2W-destination-vector/seq_MMuUpcqh-af_spispink_cassette/edit

#Set the homespace to allow for SBOL-compliant URIs
#Just use the tutorial's homespace for now
setHomespace('http://sys-bio.org')

#Define a document object where the SBOL design will be created/stored/annotated
doc = Document()

#Define the components that will go in this SBOL design

#Here's the part that defines the overall sequence of this backbone plasmid
AF_SpisPink_Cassette = ComponentDefinition('AF_SpisPink_Cassette')

#And here are various parts that are sub-components of this backbone vector
#CG126_F_Primer = ComponentDefinition('CG126') #These might need to be SequenceAnnotations rather than ComponentDefinitions, actually
#CG127_R_Primer = ComponentDefinition('CG127')

#Define a BsaI 'reverse' binding site that will cut 5' / upstream of the binding
#site on the top strand
BsaI_bind_R = ComponentDefinition('BsaI_R')
#Define a BsaI 'forward' binding site that will cut 3'/downstream of the binding
#sequence defined on the top strand
BsaI_bind_F = ComponentDefinition('BsaI_F')
#Define an object to represent the BsaI cut site upstream/5' to the insert site
BsaI_cutSite_upstream = ComponentDefinition('OH_BB_1') #Overhang Backbone 1
#And do the same for the downstream BsaI cut site
BsaI_cutSite_downstream = ComponentDefinition('OH_BB_2') #Overhang Backbone 1
#(Just note for later that ideally, restriction enzyme cutting and binding sites
# and the gap between them should be part of the same SBOL object)

#Define component for the constitutive promoter J23119
J23119_promoter = ComponentDefinition('J23119')

#Define a component for the ribosome binding site B0034
B0034_RBS = ComponentDefinition('B0034')

#Define a component for the spisPink chromoprotein CDS
spisPink_CDS = ComponentDefinition('spisPink')

#Define a component for the spisPink Forward and Reverse Primers
#sPinkF_Primer = ComponentDefinition('sPinkF')
#sPinkR_Primer = ComponentDefinition('sPinkR')

#I think the HindIII at the end of the spisPink CDS is relevant, so add 
# an object for it
#HindIII_cutSite = ComponentDefinition('HindIII')

#Define the two transcriptional terminator components
rrnb_T1_terminator = ComponentDefinition('rrnb_T1_terminator')
T7TE_terminator = ComponentDefinition('T7TE_terminator')

#I think, for the purposes of assembling the full part, it's easiest to just
# define and specify any un-annotated sequences as gaps or spacer sequences

#spacer between upstream BsaI cut and binding site
spacer1 = ComponentDefinition('spacer1')
#spacer between BsaI binding site and promoter start site
spacer2 = ComponentDefinition('spacer2')
#spacer between the end of spisPink CDS and the start of the first terminator
spacer3 = ComponentDefinition('spacer3')
#spacer between first and second terminator
spacer4 = ComponentDefinition('spacer4')
#spacer between end of 2nd terminator and start of downstream BsaI binding site
spacer5 = ComponentDefinition('spacer5')
#spacer between downstream BsaI binding and cut sites
spacer6 = ComponentDefinition('spacer6')
#Okay! All the components of this genetic device are defined. 
#Now, define the roles of all the subcomponents.
#Use the sequence ontology database of terms
#http://www.sequenceontology.org/browser/current_release/term/SO:0001976

AF_SpisPink_Cassette.roles = [sb.SO + '0000352', sb.SO + '0000546'] #SO terms for DNA and engineered sequence

BsaI_bind_R.roles = sb.SO + '0000061' #SO term for restriction enzyme bind site
BsaI_bind_F.roles = sb.SO + '0000061' #SO term for restriction enzyme bind site                                                        5'-NNNN NNNNNNNNN-3'
BsaI_cutSite_upstream.roles = sb.SO + '0001975' #Term for restriction enzyme cut site that leaves a 5' overhang, in this case like so       3'-NNNNNNNNN-5'
BsaI_cutSite_downstream.roles = sb.SO + '0001975' #Term for restriction enzyme cut site that leaves a 5' overhang, in this case  like so  5'-NNNNNNNNN-3'
J23119_promoter.roles = SO_PROMOTER                                                                                                #      3'-NNNNNNNNN NNNN-5'
B0034_RBS.roles = SO_RBS
spisPink_CDS.roles = SO_CDS

rrnb_T1_terminator.roles = SO_TERMINATOR
T7TE_terminator.roles = SO_TERMINATOR

spacer1.roles = sb.SO + '0002223' #SO term for inert spacer sequence
spacer2.roles = sb.SO + '0002223'
spacer3.roles = sb.SO + '0002223'
spacer4.roles = sb.SO + '0002223'
spacer5.roles = sb.SO + '0002223'
spacer6.roles = sb.SO + '0002223'
#Okay! Now start to assemble the whole design in the document
#Note the following objects recapitulate the whole sequence:
#OH_BB_1, spacer1, BsaI_R, spacer2, J23119, B0034,
# spisPink, spacer3, rrnb_T1_terminator, spacer4, T7TE_terminator, spacer5, BsaI_F, spacer6, OH_BB_2

#Add the ComponentDefinitions of the sub-parts of the SBOL design to the doc
doc.addComponentDefinition(AF_SpisPink_Cassette)
doc.addComponentDefinition([BsaI_bind_R, BsaI_bind_F, BsaI_cutSite_upstream,
                            BsaI_cutSite_downstream, J23119_promoter, B0034_RBS,
                            spisPink_CDS, rrnb_T1_terminator, T7TE_terminator,
                            spacer1, spacer2, spacer3, spacer4, spacer5, spacer6])
#Great! Now assemble the primary structure (the order) of the design.
#Assemble together the ComponentDefinitions of the sub-parts
#Note the following objects recapitulate the whole sequence:
#OH_BB_1, spacer1, BsaI_R, spacer2, J23119, B0034,
# spisPink, spacer3, rrnb_T1_terminator, spacer4, T7TE_terminator, spacer5, BsaI_F, spacer6, OH_BB_2
AF_SpisPink_Cassette.assemblePrimaryStructure(['OH_BB_1', 'spacer1', 'BsaI_R', 'spacer2', 'J23119', 'B0034',
                                                'spisPink', 'spacer3', 'rrnb_T1_terminator', 'spacer4',
                                                'T7TE_terminator', 'spacer5', 'BsaI_F', 'spacer6', 'OH_BB_2'])
#Now add the sequence information for all the parts in the design
#Define the sequences of the different parts in the SBOL design

BsaI_bind_R.sequence = Sequence('BsaI_R', 'GAGACC')
BsaI_bind_F.sequence = Sequence('BsaI_F', 'GGTCTC')
BsaI_cutSite_upstream.sequence = Sequence('OH_BB_1', 'GGAG')
BsaI_cutSite_downstream.sequence = Sequence('OH_BB_2', 'CGCT')
J23119_promoter.sequence = Sequence('J23119', 'TTGACAGCTAGCTCAGTCCTAGGTATAATGCTAGC')
B0034_RBS.sequence = Sequence('B0034', 'TACTAGAGAAAGAGGAGAAATACTAA')
spisPink_CDS.sequence = Sequence('spisPink', 'ATGTCGCACTCAAAACAAGCACTGGCAGATACGATGAAAATGACGTGGCTGATGGAAGGTTCGGTCAACGGTCACGCCTTTACGATTGAAGGCGAAGGTACCGGCAAACCGTATGAAGGTAAACAGAGTGGCACCTTTCGTGTTACGAAAGGCGGTCCGCTGCCGTTTGCGTTTGATATTGTCGCTCCGACCCTGAAATACGGTTTCAAATGCTTCATGAAATACCCGGCGGATATTCCGGACTACTTTAAACTGGCCTTCCCGGAAGGCCTGACGTACGATCGTAAAATTGCGTTTGAGGACGGCGGTTGTGCGACCGCCACGGTCGAAATGAGCCTGAAGGGTAACACCCTGGTGCATAAAACGAACTTTCAGGGCGGTAATTTCCCGATTGATGGTCCGGTGATGCAAAAACGCACCCTGGGCTGGGAACCGACCTCTGAAAAAATGACGCCGTGCGATGGTATTATCAAAGGCGACACCATCATGTACCTGATGGTTGAAGGCGGTAAAACGCTGAAATGTCGTTATGAAAACAATTACCGCGCCAACAAACCAGTGCTGATGCCGCCGAGCCACTTTGTGGATCTGCGTCTGACCCGCACGAATCTGGATAAAGAAGGTCTGGCGTTCAAACTGGAAGAATATGCTGTTGCCCGTGTGCTGGAAGTGTAATAA')
rrnb_T1_terminator.sequence = Sequence('rrnb_T1_terminator', 'CAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTC')
T7TE_terminator.sequence = Sequence('T7TE_terminator', 'GGCTCACCTTCGGGTGGGCCTTTCTGCG')
spacer1.sequence = Sequence('spacer1', 'C')
spacer2.sequence = Sequence('spacer2', 'ATTGACAGCGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCG')
spacer3.sequence = Sequence('spacer3', 'GCTT')
spacer4.sequence = Sequence('spacer4', 'TACTAGAGTCACACT')
spacer5.sequence = Sequence('spacer5', 'C')
spacer6.sequence = Sequence('spacer6', 'A')
#Assign to AF_SpisPink_Cassette the sequence generated by assembling the primary structure
#of the other genetic parts
fullDesign_sequence = AF_SpisPink_Cassette.compile()
print(AF_SpisPink_Cassette.sequence.elements)
#Excellent, I've been able to compile the complete sequence. Now, can I add the
# primers as sequence annotations?

#Define the SequenceAnnotation object
CG126_F_Primer = SequenceAnnotation('CG126')
#Define a role for this sequence annotation
CG126_F_Primer.roles = sb.SO + '0000121' #The SO term for forward primer

#Define which ComponentDefinition the SequenceAnnotation refers to
CG126_F_Primer.component = 'AF_SpisPink_Cassette'

#Define a Range object for where the SequenceAnnotation will go on its component
CG126_F_PrimerRange = Range('CG126_F_PrimerRange')
CG126_F_PrimerRange.start = 1
CG126_F_PrimerRange.end = 19

#Define a Location object which will hold the Range start/end information
CG126_F_PrimerLoc = Location('CG126_F_PrimerLoc')
CG126_F_PrimerLoc.Range = CG126_F_PrimerRange
#Define an orientation for the Location (top strand or bottom strand)
CG126_F_PrimerLoc.orientation = SBOL_ORIENTATION_INLINE

#Now define the location of the testPrimer sequence annotation, using testPrimerLoc

CG126_F_Primer.locations.add(CG126_F_PrimerLoc)

#And for reference purposes, define the sequence in the testPrimer SequenceAnnotation
CG126_F_Primer.sequence = 'GGAGCGAGACCATTGACAG'
#And link to the Benchling file from which this design was originally derived
CG126_F_Primer.wasDerivedFrom = 'https://benchling.com/openbioeconomy/f/lib_RSHKnK2W-destination-vector/seq_MMuUpcqh-af_spispink_cassette/edit'
#I think this works as a way to add annotations?

#I think this works as a way to add annotations?
AF_SpisPink_Cassette.sequenceAnnotations.add(CG126_F_Primer)
#Excellent, I've been able to add primers as sequence annotations.
#Now add the other primers.

#Define the SequenceAnnotation object
CG127_R_Primer = SequenceAnnotation('CG127')

#Define which ComponentDefinition the SequenceAnnotation refers to
CG127_R_Primer.component = 'AF_SpisPink_Cassette'

#Define a Range object for where the SequenceAnnotation will go on its component
CG127_R_PrimerRange = Range('CG127_R_PrimerRange')
CG127_R_PrimerRange.start = 912
CG127_R_PrimerRange.end = 925

#Define a Location object which will hold the Range start/end information
CG127_R_PrimerLoc = Location('CG127_R_PrimerLoc')
CG127_R_PrimerLoc.Range = CG127_R_PrimerRange
#Define an orientation for the Location (top strand or bottom strand)
CG127_R_PrimerLoc.orientation = SBOL_ORIENTATION_REVERSE_COMPLEMENT

#Now define the location of the testPrimer sequence annotation, using testPrimerLoc
CG127_R_Primer.locations.add(CG127_R_PrimerLoc)

#And for reference purposes, define the sequence in the testPrimer SequenceAnnotation
CG127_R_Primer.sequence = 'AGCGTGAGACCGCG'
#And link to the Benchling file from which this design was originally derived
CG127_R_Primer.wasDerivedFrom = 'https://benchling.com/openbioeconomy/f/lib_RSHKnK2W-destination-vector/seq_MMuUpcqh-af_spispink_cassette/edit'
#I think this works as a way to add annotations?

AF_SpisPink_Cassette.sequenceAnnotations.add(CG127_R_Primer)
#Now add the other primers.

#Define the SequenceAnnotation object
sPinkF_Primer = SequenceAnnotation('sPinkF')

#Define which ComponentDefinition the SequenceAnnotation refers to
sPinkF_Primer.component = 'AF_SpisPink_Cassette'

#Define a Range object for where the SequenceAnnotation will go on its component
sPinkF_PrimerRange = Range('sPinkF_PrimerRange')
sPinkF_PrimerRange.start = 116
sPinkF_PrimerRange.end = 141

#Define a Location object which will hold the Range start/end information
sPinkF_PrimerLoc = Location('sPinkF_PrimerLoc')
sPinkF_PrimerLoc.Range = sPinkF_PrimerRange
#Define an orientation for the Location (top strand or bottom strand)
sPinkF_PrimerLoc.orientation = SBOL_ORIENTATION_INLINE

#Now define the location of the testPrimer sequence annotation, using testPrimerLoc
sPinkF_Primer.locations.add(sPinkF_PrimerLoc)

#And for reference purposes, define the sequence in the testPrimer SequenceAnnotation
sPinkF_Primer.sequence = 'AAGCTCTTCATCCAATGTCGCACTCAAAACAAGCACTGG' #Note that this primer has some extra non-complementary sequence on its 5' end
#And link to the Benchling file from which this design was originally derived
sPinkF_Primer.wasDerivedFrom = 'https://benchling.com/openbioeconomy/f/lib_RSHKnK2W-destination-vector/seq_MMuUpcqh-af_spispink_cassette/edit'
#I think this works as a way to add annotations?

AF_SpisPink_Cassette.sequenceAnnotations.add(sPinkF_Primer)
#Now add the other primers.

#Define the SequenceAnnotation object
sPinkR_Primer = SequenceAnnotation('sPinkR')

#Define which ComponentDefinition the SequenceAnnotation refers to
sPinkR_Primer.component = 'AF_SpisPink_Cassette'

#Define a Range object for where the SequenceAnnotation will go on its component
sPinkR_PrimerRange = Range('sPinkR_PrimerRange')
sPinkR_PrimerRange.start = 769
sPinkR_PrimerRange.end = 788

#Define a Location object which will hold the Range start/end information
sPinkR_PrimerLoc = Location('sPinkR_PrimerLoc')
sPinkR_PrimerLoc.Range = sPinkR_PrimerRange
#Define an orientation for the Location (top strand or bottom strand)
sPinkR_PrimerLoc.orientation = SBOL_ORIENTATION_REVERSE_COMPLEMENT

#Now define the location of the testPrimer sequence annotation, using testPrimerLoc
sPinkR_Primer.locations.add(sPinkR_PrimerLoc)

#And for reference purposes, define the sequence in the testPrimer SequenceAnnotation
sPinkR_Primer.sequence = 'TTGCTCTTCTTCGACCTCCCACTTCCAGCACACGGGCAA' #Note that this primer has some extra non-complementary sequence on its 5' end
#And link to the Benchling file from which this design was originally derived
sPinkR_Primer.wasDerivedFrom = 'https://benchling.com/openbioeconomy/f/lib_RSHKnK2W-destination-vector/seq_MMuUpcqh-af_spispink_cassette/edit'

AF_SpisPink_Cassette.sequenceAnnotations.add(sPinkR_Primer)
#Finally, add this HindIII site that I think might be relevant as a sequenceAnnotation
HindIII_site = SequenceAnnotation('HindIII')
HindIII_site.component = 'AF_SpisPink_Cassette'
HindIII_siteRange = Range('HindIII_siteRange')
HindIII_siteRange.start = 793
HindIII_siteRange.end = 798
#Define a Location object which will hold the Range start/end information
HindIII_siteLoc = Location('sPinkR_PrimerLoc')
HindIII_siteLoc.Range = HindIII_siteRange
#Don't need an orientation for a palindromic restriction enzyme site
HindIII_site.locations.add(HindIII_siteLoc)
HindIII_site.sequence = 'AAGCTT'
AF_SpisPink_Cassette.sequenceAnnotations.add(HindIII_site)
for i in AF_SpisPink_Cassette.sequenceAnnotations:
  print(i)
#Now, save the design and export it!
folderPath = '/content/drive/MyDrive/Frugal Science/Friendzymes/Wetware Design/pySBOL/'
result = doc.write(folderPath + 'AF_SpisPink_Cassette.xml')

doc.exportToFormat('GenBank', folderPath + 'AF_SpisPink_Cassette.gb')
bbartley commented 3 years ago

Hi @eyesmo thanks for trying out pysbol!

One thing you can try before calling doc.write is to call print(doc.validate()) to see if you can identify problems with the Document

An easy fix I would suggest trying is to simply replace pysbol with https://github.com/SynBioDex/pySBOL2. pySBOL2 is implemented natively in Python rather than C++ so it is more stable than pySBOL. The APIs are exactly the same, so your code should work as is.

eyesmo commented 3 years ago

Thanks for the suggestions! I've switched to using pySBOL2.

print(doc.validate()) throws the following strong validation error: Invalid. sbol-10902:� Strong Validation Error:� The locations property of a SequenceAnnotation is REQUIRED and MUST contain a non-empty set of Location objects. �Reference: SBOL Version 2.3.0 Section 7.7.4 on page 32 :� http://sys-bio.org/ComponentDefinition/AF_SpisPink_Cassette/sPinkF/1� Validation failed.

So I think the problem is arising when I attempt to add additional SequenceAnnotations with specific Locations to the ComponentDefinition for the compiled design. The Location (and Range?) objects I try to create and assign to the SequenceAnnotation aren't being added correctly.

Here's an example of how I'm trying to add an annotation for a primer sequence. Does anything immediately pop out as incorrectly done?

#Now add the other primers.

#Define the SequenceAnnotation object
sPinkF_Primer = SequenceAnnotation('sPinkF')

#Define which ComponentDefinition the SequenceAnnotation refers to
sPinkF_Primer.component = 'AF_SpisPink_Cassette'

#Define a Range object for where the SequenceAnnotation will go on its component
sPinkF_PrimerRange = Range('sPinkF')
sPinkF_PrimerRange.start = 116
sPinkF_PrimerRange.end = 141

#Define a Location object which will hold the Range start/end information
sPinkF_PrimerLoc = Location('sPinkF')
sPinkF_PrimerLoc.Range = sPinkF_PrimerRange
#Define an orientation for the Location (top strand or bottom strand)
sPinkF_PrimerLoc.orientation = SBOL_ORIENTATION_INLINE

#Now define the location of the testPrimer sequence annotation, using testPrimerLoc
sPinkF_Primer.locations.add(sPinkF_PrimerLoc)
print(sPinkF_PrimerLoc)
print(sPinkF_Primer.locations)
#And for reference purposes, define the sequence in the testPrimer SequenceAnnotation
sPinkF_Primer.sequence = 'AAGCTCTTCATCCAATGTCGCACTCAAAACAAGCACTGG' #Note that this primer has some extra non-complementary sequence on its 5' end
#And link to the Benchling file from which this design was originally derived
sPinkF_Primer.wasDerivedFrom = 'https://benchling.com/openbioeconomy/f/lib_RSHKnK2W-destination-vector/seq_MMuUpcqh-af_spispink_cassette/edit'
#I think this works as a way to add annotations?

AF_SpisPink_Cassette.sequenceAnnotations.add(sPinkF_Primer)
bbartley commented 3 years ago

These lines are potentially problematic:

#Define a Range object for where the SequenceAnnotation will go on its component
sPinkF_PrimerRange = Range('sPinkF')
sPinkF_PrimerRange.start = 116
sPinkF_PrimerRange.end = 141

#Define a Location object which will hold the Range start/end information
sPinkF_PrimerLoc = Location('sPinkF')
sPinkF_PrimerLoc.Range = sPinkF_PrimerRange
#Define an orientation for the Location (top strand or bottom strand)
sPinkF_PrimerLoc.orientation = SBOL_ORIENTATION_INLINE

#Now define the location of the testPrimer sequence annotation, using testPrimerLoc
sPinkF_Primer.locations.add(sPinkF_PrimerLoc)

And can be replaced with the following. Note that Range is a subclass of Location, you only need to create one or the other type of object, not both:

#Define a Range object for where the SequenceAnnotation will go on its component
sPinkF_PrimerRange = Range('sPinkF')
sPinkF_PrimerRange.start = 116
sPinkF_PrimerRange.end = 141
sPinkF_PrimerRange.orientation = SBOL_ORIENTATION_INLINE

#Now define the location of the testPrimer sequence annotation, using testPrimerLoc
sPinkF_Primer.locations.add(sPinkF_PrimerRange)
eyesmo commented 3 years ago

Thanks so much! That and a couple other tweaks got the whole design to be valid. One last question: how would you recommend visualizing a finished design, to check that the sequence is correct and features are correctly positioned? I tried exporting to genbank but the conversion was quite lossy when I opened it in Snapgene Viewer: the full sequence showed up, but all the annotations were labeled only with their roles (promoter, CDS, etc), and the name for each part was missing.

Screen Shot 2021-02-03 at 6 21 43 PM

I also found Snapgene viewer can't open the .xml file that contains the complete SBOL design description.

Are there sequence viewing applications you'd recommend that can open SBOL .xml design files (does SBOLDesigner 2 also have Snapgene/Benchling-like sequence browsing and editing capabilities)? Alternatively, is there an export format or command you'd recommend that generally recovers both the component type and the component name (and maybe even the component description) in the exported file format?

bbartley commented 3 years ago

Generally speaking there isn't great interoperability between sequence editors and SBOL tools...there's definitely a need. As for visualization tools, there aren't any great ones in Python. There are the https://sbolcanvas.org/canvas/ and http://visbol.org/ webtools for visualization. Also, I believe there is a SnapGene plug-in for Synbiohub. For more about this I'll refer you to @cjmyers . Chris, is there anything more you want to say about the SnapGene plug-in and how well conversion works between SnapGene and SBOL?

jakebeal commented 2 years ago

As there has been no activity on this issue for nearly a year, I'm going to close it. Please open a new issue if additional help is needed.