Error if ~harmonize_rca used in GenBank

veghp commented 3 years ago

If the Genbank file has ~harmonize_rca(e_coli => h_sapiens), then running

from dnachisel import DnaOptimizationProblem
problem = DnaOptimizationProblem.from_record("example_sequence.gb")

returns

/path/to/DnaChisel/dnachisel/Specification/FeatureRepresentationMixin.py in from_label(cls, label, location, specifications_dict)
    127                 "Unknown parameter %s for specification %s "
    128                 "at location %s"
--> 129                 % (faulty_parameter, specification, kwargs["location"])
    130             )
    131 

TypeError: Unknown parameter e_coli  for specification harmonize_rca at location 1218-2308(+)

Same error for h_sapiens or h_sapiens_9606.

The function works with CodonOptimize(), e.g. CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli').

Zulko commented 3 years ago

Sorry for that, could have been caught with a test. What probably happens here is that HarmonizeRCA("e_coli => h_sapiens") should work, but the genbank interpreter reads the = sign as HarmonizeRCA(e_coli='>h_sapiens'). One fix would be to replace the => syntax by ->.

veghp commented 3 years ago

Thanks, unfortunately it doesn't work with quotes either as it splits by =, but changing the syntax to -> should be a good approach. I'll also look into how it handles spaces around the arrow, and that it handles two species given.

Also need to clarify the syntax for TaxID in the documentation (e.g. species=1423), for two species.

As a side note, I've come across a limitation of Genbank: feature names are max 50 character, due to the fixed width of the format (80 characters).

     misc_feature    1219..2308
                     /label=~harmonize_rca(e_coli => h_sapiens)
<----  28 characters   ----><---            max 50 characters             --->

So the syntax must be short enough to fit into this.

Zulko commented 3 years ago

Sorry for being unclear, I meant HarmonizeRCA("e_coli => h_sapiens") should work in a script. The genbank limitation is a bit annoying indeed. Some fields can have multiple lines, but not sure for labels.

Zulko commented 3 years ago

@veghp I believe this one can be fixed simply by replacing the Genbank syntax => by ->. Do you want help with that?

veghp commented 3 years ago

Thanks, so I did look into it at that time and the real issue is how the parameters are generated from a Biopython ~... record annotation and passed to the function. The issue mentioned in the first comment is due to an expected parameter=value pattern during parsing.

For example, using

from dnachisel import DnaOptimizationProblem
problem = DnaOptimizationProblem(
    sequence=random_dna_sequence(99),
    objectives=[CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli', method="harmonize_rca"), ]
)
problem.objectives

returns [HarmonizeRCA[0-99](h_sapiens)] which is correct.

problem.optimize_with_report(target="report_random_test.zip")

As is the corresponding report.

However, if I replace with -> in the code and use the attached example Genbank file, then it optimizes for E. coli:

# Single-line arrow, quoted
from dnachisel import DnaOptimizationProblem
problem = DnaOptimizationProblem.from_record("example_sequence.gb")
problem.optimize_with_report(target="report_example_sequence.zip")

In detail:

record = load_record("example_sequence.gb")
record.features[2].qualifiers["label"]
# ['~harmonize_rca(e_coli->h_sapiens)']
dnachisel.biotools.find_specification_label_in_feature(record.features[2])
# '~harmonize_rca(e_coli->h_sapiens)'
from dnachisel.Specification.Specification import Specification
specs = Specification.list_from_biopython_feature(
                record.features[2], specifications_dict="default",
            )
specs
# [('objective', HarmonizeRCA[1218-2310(+)](e_coli))]

This warrants some refactoring ( FeatureRepresentationMixin.from_label() ? ) but I didn't have time to get back to this, so advice/help is welcome.

example_sequence_gb.zip

veghp commented 3 years ago

Another issue, which I've noticed only now, is that loading a quoted feature returns:

/Bio/GenBank/init.py:1291: BiopythonParserWarning: The NCBI states double-quote characters like " should be escaped as "" (two double - quotes), but here it was not: '~harmonize_rca("e_coli -> h_sapiens")'
  BiopythonParserWarning,

Indeed Snapgene Viewer saves a quoted annotation with 2x2 quote; for example /label=~harmonize_rca(""e_coli => h_sapiens""), so I recommend implementing this feature without quotes.

Zulko commented 3 years ago

CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli', method="harmonize_rca") returns HarmonizeRCA[0-99](h_sapiens) which is correct.

That's not 100% correct, in an idea world it would return HarmonizeRCA[0-99](e_coli->h_sapiens). I think the feature->string conversion could be improved.

Indeed Snapgene Viewer saves a quoted annotation with 2x2 quote; for example /label=~harmonize_rca(""e_coli => h_sapiens""), so I recommend implementing this feature without quotes.

To be clear, the Genbank API should never use quotes, so you would use HarmonizeRCA("e_coli->h_sapiens") in a python script (equivalent to CodonOptimize(species='h_sapiens', original_species='e_coli', method="harmonize_rca")). ButHarmonizeRCA(e_coli -> h_sapiens)` in a genbank file.

However, if I replace with -> in the code and use the attached example Genbank file, then it optimizes for E. coli:

Hmm that's indeed a bug. ~harmonize_rca(e_coli -> h_sapiens) should set species to h_sapiens and origin_species to e_coli and it looks like it's failing at that :thinking:. I can probably help fix this over the weekend.

veghp commented 3 years ago

Or perhaps this line should be:

            original_species, species = species.split("->")

Zulko commented 3 years ago

Yes this is probably it

veghp commented 3 years ago

Thanks for the comments and feedback I made a commit that fixes this. Merge pending on some more tests, a test suit, and a new doc image. I'll also look into the string conversion.

veghp commented 3 years ago

Perhaps we can also add a little functionality to codon optimisation that checks for overlapping @cds annotations and somehow warns the user if it's missing.

Edinburgh-Genome-Foundry / DnaChisel

Error if ~harmonize_rca used in GenBank #55