SynBioDex / SBOL-Validator

A web application to validate SBOL files
https://validator.sbolstandard.org
Apache License 2.0
8 stars 3 forks source link

HELP #121

Open geoffbaldwin opened 3 years ago

geoffbaldwin commented 3 years ago

I am trying to test SBOL converter to convert genbank files to SBOL. I keeps flagging the following error: Converting GenBank to SBOL Version 2 TopLevel https://synbiohub.org/user/gbaldwin/ not found

I have used a valid synbiohub URL - I have no idea what it is looking for here. I also don't know what to include in the URI prefix for converted objects. There is no documentation on this and the video only covers import of SBOL files and conversion to other formats. I need help getting Genbank files into SBOL. Thanks, Geoff

cjmyers commented 3 years ago

You should not be providing a TopLevel URI. You do need to provide a URI prefix. Do not though use a SBH URI prefix. Instead, maybe use something like https://baldwin.org/ or really any domain. You might also use the converter rather than the validator, since it has less options that may lead to confusion:

https://converter.sbolstandard.org

jakebeal commented 3 years ago

Can I suggest https://www.imperial.ac.uk/baldwinlab/[projectname] ?

geoffbaldwin commented 3 years ago

Thanks for the useful input. That worked and it overcame one issue compared to directly importing .gb files to SynBioHub in that it preserved the name of the object. However the annotation of the imported object hasn't worked so well.

I am exporting parts from Benchling with the intention of pushing a fairly large library from Benchling to SBH. When the files are imported none of the annotations are correctly classified, so promoters, RBS etc are all just engineered regions and do not have the correct ontology. I was hoping that a file converter might deal with these issues, but apparently not.

Any suggestions how to do a better job on this? The engineered regions have labels e.g. Terminator; Promoter in the .gb - can these be converted into the correct sequence ontology so they display correctly in SBH?

Example .gb file attached (as .txt) B-P1-J231119-F1.txt

?

geoffbaldwin commented 3 years ago

This is what it looks like after import https://synbiohub.org/user/gbaldwin/BASIC/B_P1_J231119_F1/1/ddbbb21226203031a38224a0b9366a339432787f/share

cjmyers commented 3 years ago

The issue you are having is due to a really inconvenient feature of Benchling. Namely, the annotation type field is free text, so it does not restrict you to a limited set of types. SnapGene on the other hand limits you to a semi-standard set of GenBank annotation feature types. Without this restriction than converting to Sequence Ontology types is not possible to do in all cases. Any variations in your text string with the GenBank feature types makes it difficult or impossible to know what type you are referring to.

I mention that the GenBank feature types are semi-standard. There is nowhere that I have been able to find a list of the standard GenBank feature types. Instead, there is sort of a community sourced agreement on what they should be. I've collected these from various sources and mapped them to the Sequence Ontology. This is what the GenBank to SBOL converter does. Here is the list:

https://docs.google.com/spreadsheets/d/1X870i3NhO7xEhqhLXK4eravNd72x-O-xbrpmlT835nY/edit?usp=sharing

We could add to this list, but again Benchling not restricting your types makes this a never ending problem. I suggest that if you want good conversion that you restrict the types you use in Benchling to this list. This spreadsheet also lists the SBOL Visual glyph that you get for the specified GenBank feature type. Note that many GenBank features do not yet have SBOL Visual glyphs assigned to them. On the flip side, there are SBOL Visual glyphs without a corresponding GenBank feature type.

As per your specific example, nucleotide, spacer, and RiboJ are not GenBank feature types. While Terminator and Promoter, should be terminator and promoter. I could potentially fix these later two by making my converter case insensitive. I have hesitated to do this though to make round-tripping consistent.

In summary, if you want your GenBank features to convert to specific Sequence Ontology features, you need to use the semi-standard list. If things are missing, you can suggest additions to our list. In any case, care needs to be taken when entering your types in Benchling, since mis-spellings will defeat the conversion.

geoffbaldwin commented 3 years ago

Thanks Chris, that's really helpful. Working with a limited set of annotation types in Benchling is certianly a good way forward. Once the features are correctly defined then the auto-annotation will propagate that to reduce errors and inconsistencies. Making conversion case-insensitive would for me seem sensible - using lower case for rbs, cds etc looks wrong when viewing in Benchling and limits production of useful figures for reports and publications. Being able to use ORI instead of origin-of-replication would be good - could multiple syntaxes map to the same SO? [NB - confused by the term D-loop, I looked up the SO, origin of replication should be SO:0000296 - this is different from a D-loop SO:0000297 http://www.sequenceontology.org/browser/current_release/term/SO:0000296 ]

I will take a look at some of the other features that we have that correspond to SBOL glyphs to see if there are other useful mappings that we could suggest.

jakebeal commented 3 years ago

@cjmyers What do you think of incorporating TYTO lookups to try to resolve unknown terms?

geoffbaldwin commented 3 years ago

TYTO? Lookups sound useful. Might be cumbersome if converting a long list of files - does it handle mulitple files?

geoffbaldwin commented 3 years ago

So here's a few more suggestions to add to the list: GenBank | SO Term | SBOL Visual

ribozyme | SO:0000374 | RS bidirectional_promoter | SO:0000568 |   TF_binding_site | SO:0000235 | bind constitutive_promoter | SO:0002050 | Pro inducible_promoter | SO:0002051 | Pro core_promoter_element | SO:0002309 | Pro sgRNA | SO:0001998 | gRNA siRNA | SO:0000646 | gRNA miRNA | SO:0000276 | gRNA

jakebeal commented 3 years ago

@geoffbaldwin https://github.com/SynBioDex/tyto is a python library that @bbartley has built that has functions that let one easily map between names and ontology terms.

cjmyers commented 3 years ago

Using tyto sounds plausible to pull this out of the JAVA code and make it more extensible. However, I'm not exactly sure how to integrate the python code into the JAVA library. I guess it could be through a webservice, but this would make off-line conversion not possible.

jakebeal commented 3 years ago

@cjmyers Looks like there's simple solutions for calling python from Java, as long as your python is native (which TYTO is): https://stackoverflow.com/questions/8898765/calling-python-in-java/8899042