SynBioDex / libSBOLj

Java Library for Synthetic Biology Open Language (SBOL)
Apache License 2.0
38 stars 24 forks source link

Converter mangle-merges unrecognized GenBank features #594

Closed jakebeal closed 5 years ago

jakebeal commented 5 years ago

If the SBOL Converter (powered by libSBOLj) sees a feature that it doesn't recognize, then it fails to parse it correctly. Instead, it either a) mangles it together with the prior feature, or b) omits it entirely, when it's the first feature in the list.

Two examples: converter_failure_example.gz

cjmyers commented 5 years ago

Soo this example file was corrupt. Here is a fixed version: converter_failure_example.zip

This seems to be working now, but please double check @jakebeal

jakebeal commented 5 years ago

In what way was this corrupt, and can we be smarter about having the library recognize mis-formatting? This wasn't hand-generated, but was handed to us as an export from a design tool (maybe Benchling?)

cjmyers commented 5 years ago

When I downloaded the file, it was not recognized as a zip file. I had to hand extract out the GenBank files.

The problem with benchling is that it allows the user to enter free text for feature types, which is possibly why there are these errors. In general, tools do not limit to the GenBank accepted features. As per an earlier email thread, there is no list that I can find anywhere of GenBank accepted features. libSBOLj's converter uses a list that Nathan Hillson created from an earlier GenBank converter. I pinged him about it, and he could not remember where he got the list. You were cc'ed on that email exchange (see Re: GenBank Features).

Not sure what you want us to do exactly. I'm not comfortable doing some auto correction of spelling errors. I think this is outside the scope of libSBOLj. If you have regular mis-spellings, it is better to correct these through a script before you send them to the converter. This should be a semi-automated process in which there is manual confirmation.

jakebeal commented 5 years ago

I don't know what's going on with your download, as my machine recognizes and auto-extracts.

But why can't you at least parse open named features, even if you don't know how to interpret them semantically? And if there's a poorly formatted file, that shouldn't cause a silent failure like this, it should cause warnings or errors.

cjmyers commented 5 years ago

They are being parsed. The errors you mentioned above are fixed, I believe, see:

https://synbiohub.org/user/myers/GenBankFeatures/GenBankFeatures_collection/1/dc46c0d43f614f3fde48f6059f71a189326c74d4/share

They both have the recombination site. Though "specific_recombination_site" is not recognized by the converter, so it gets converted into a sequence feature, and this type is added as an annotation to the SequenceAnnotation.

The mangling you discovered was indeed an error, and libSBOLj has been updated to prevent that. What it is not able to do is to take free text GenBank features and figure out which SO term would be the closest term.

cjmyers commented 5 years ago

BTW: the zip file I posted includes the same GenBank files, but not the SBOL files. This zip file uploads to SBH just fine. Nothing is dropped as far as I can tell.

One thing that could be done is that we could possibly check unknown GenBank features against SO to see if there is the same term in SO. The thing that really concerns me is that GenBank feature types are not precisely defined anywhere. Please go back and look at the thread with Nathan about how he got his list for conversion. It is really unclear to me what is indeed a valid GenBank feature type.

jakebeal commented 5 years ago

Good to hear that the mangling is fixed; that was not clear to me from your prior comments.

Doulix has a much more comprehensive feature list, which they have indicated that they are willing to share. I'll put you in touch.

cjmyers commented 5 years ago

Ok, thanks. Will close this one then.

DLDavide commented 5 years ago

Hi @jakebeal @cjmyers,

Genbank was conceived as a standard and acceptable "feature key" are detailed here.

Unfortunately, the lack of tools to validate .gb files resulted in gb to become a set of loosely applied formatting guidelines :-(

For instance, most of bio sw now export "promoter" as a key feature, whereas it shall be a "regulatory" key feature with a "regulatory_type=promoter" qualifier.

This kind of freedom has made gb challenging to import as different sw use different rules, we have compiled a gb-2-so dictionary that shall cover most of the sw in commerce right now.

We would be glad to share it if it helps.