czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.
MIT License
18 stars 4 forks source link

Remove Biopython dependency #82

Closed olgabot closed 4 years ago

olgabot commented 4 years ago

This PR removes the biopython dependency because a lot of time is spent converting between Python strings and Biopython Seq objects and back, which makes sencha translate take forever

PR checklist

olgabot commented 4 years ago
timeit.timeit(
   ...:         'TranslateSingleSeq.three_frame_translation(Seq("CGCTTGCTTAATACTGACATCAATAATATTAGGAAAATCGCAATATAACTGTAAATCCTGTTCTGTC"))',
   ...:     setup='from Bio.Seq import Seq\nfrom sencha.translate_single_seq import TranslateSingleSeq',
   ...:     number=int(1e6))
Out[13]: 0.6314636569999834

New way:

from sencha.constants_translate import STANDARD_CODON_TABLE_MAPPING
timeit.timeit(
   ...:         'TranslateSingleSeq.three_frame_translation(Seq("CGCTTGCTTAATACTGACATCAATAATATTAGGAAAATCGCAATATAACTGTAAATCCTGTTCTGTC"))',
   ...:     setup='from Bio.Seq import Seq\nfrom sencha.translate_single_seq import TranslateSingleSeq',
   ...:     number=int(1e6))
Out[18]: 0.5854893400000094

Hmm, only 8% faster??

0.5854893400000094/0.6314636569999834
Out[19]: 0.9271940411925002
olgabot commented 4 years ago

Now the reads just stay as pure Python strings! No Biopython backend necessary. The translation happens in translate_single_seq.py using the STANDARD_CODON_TABLE specified in constants_translate.py.