Closed veghp closed 3 years ago
Sorry for that, could have been caught with a test. What probably happens here is that HarmonizeRCA("e_coli => h_sapiens")
should work, but the genbank interpreter reads the =
sign as HarmonizeRCA(e_coli='>h_sapiens')
. One fix would be to replace the =>
syntax by ->
.
Thanks, unfortunately it doesn't work with quotes either as it splits by =
, but changing the syntax to ->
should be a good approach. I'll also look into how it handles spaces around the arrow, and that it handles two species given.
Also need to clarify the syntax for TaxID in the documentation (e.g. species=1423
), for two species.
As a side note, I've come across a limitation of Genbank: feature names are max 50 character, due to the fixed width of the format (80 characters).
misc_feature 1219..2308
/label=~harmonize_rca(e_coli => h_sapiens)
<---- 28 characters ----><--- max 50 characters --->
So the syntax must be short enough to fit into this.
Sorry for being unclear, I meant HarmonizeRCA("e_coli => h_sapiens")
should work in a script. The genbank limitation is a bit annoying indeed. Some fields can have multiple lines, but not sure for labels.
@veghp I believe this one can be fixed simply by replacing the Genbank syntax =>
by ->
. Do you want help with that?
Thanks, so I did look into it at that time and the real issue is how the parameters are generated from a Biopython ~...
record annotation and passed to the function. The issue mentioned in the first comment is due to an expected parameter=value pattern during parsing.
For example, using
from dnachisel import DnaOptimizationProblem
problem = DnaOptimizationProblem(
sequence=random_dna_sequence(99),
objectives=[CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli', method="harmonize_rca"), ]
)
problem.objectives
returns [HarmonizeRCA[0-99](h_sapiens)]
which is correct.
problem.optimize_with_report(target="report_random_test.zip")
As is the corresponding report.
However, if I replace with ->
in the code and use the attached example Genbank file, then it optimizes for E. coli:
# Single-line arrow, quoted
from dnachisel import DnaOptimizationProblem
problem = DnaOptimizationProblem.from_record("example_sequence.gb")
problem.optimize_with_report(target="report_example_sequence.zip")
In detail:
record = load_record("example_sequence.gb")
record.features[2].qualifiers["label"]
# ['~harmonize_rca(e_coli->h_sapiens)']
dnachisel.biotools.find_specification_label_in_feature(record.features[2])
# '~harmonize_rca(e_coli->h_sapiens)'
from dnachisel.Specification.Specification import Specification
specs = Specification.list_from_biopython_feature(
record.features[2], specifications_dict="default",
)
specs
# [('objective', HarmonizeRCA[1218-2310(+)](e_coli))]
This warrants some refactoring ( FeatureRepresentationMixin.from_label()
? ) but I didn't have time to get back to this, so advice/help is welcome.
Another issue, which I've noticed only now, is that loading a quoted feature returns:
/Bio/GenBank/init.py:1291: BiopythonParserWarning: The NCBI states double-quote characters like " should be escaped as "" (two double - quotes), but here it was not: '~harmonize_rca("e_coli -> h_sapiens")'
BiopythonParserWarning,
Indeed Snapgene Viewer saves a quoted annotation with 2x2 quote; for example /label=~harmonize_rca(""e_coli => h_sapiens"")
, so I recommend implementing this feature without quotes.
CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli', method="harmonize_rca")
returnsHarmonizeRCA[0-99](h_sapiens)
which is correct.
That's not 100% correct, in an idea world it would return HarmonizeRCA[0-99](e_coli->h_sapiens)
. I think the feature->string conversion could be improved.
Indeed Snapgene Viewer saves a quoted annotation with 2x2 quote; for example /label=~harmonize_rca(""e_coli => h_sapiens""), so I recommend implementing this feature without quotes.
To be clear, the Genbank API should never use quotes, so you would use HarmonizeRCA("e_coli->h_sapiens")
in a python script (equivalent to CodonOptimize(species='h_sapiens', original_species='e_coli', method="harmonize_rca"
)). But
HarmonizeRCA(e_coli -> h_sapiens)` in a genbank file.
However, if I replace with -> in the code and use the attached example Genbank file, then it optimizes for E. coli:
Hmm that's indeed a bug. ~harmonize_rca(e_coli -> h_sapiens)
should set species to h_sapiens
and origin_species to e_coli
and it looks like it's failing at that :thinking:. I can probably help fix this over the weekend.
Or perhaps this line should be:
original_species, species = species.split("->")
Yes this is probably it
Thanks for the comments and feedback I made a commit that fixes this. Merge pending on some more tests, a test suit, and a new doc image. I'll also look into the string conversion.
Perhaps we can also add a little functionality to codon optimisation that checks for overlapping @cds
annotations and somehow warns the user if it's missing.
If the Genbank file has ~harmonize_rca(e_coli => h_sapiens), then running
returns
Same error for
h_sapiens
orh_sapiens_9606
.The function works with CodonOptimize(), e.g.
CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli')
.