biocore / deblur

Deblur is a greedy deconvolution algorithm based on known read error profiles.
BSD 3-Clause "New" or "Revised" License
92 stars 41 forks source link

Improve execution time #167

Closed karlrl closed 6 years ago

karlrl commented 6 years ago

Using Pysam (which wraps htslib) speeds up the sequence trimming step by a good amount (~50x). This changeset switches from using skbio.read() to using pysam.FastxFile() and makes a few other changes to accommodate the FastxFile() requirement that the file be passed as a path (or stdin), not a Python file-like object.

As an example of the speedup, processing a ~50k FASTA file took 65s with skbio.read() and 1.4.s with pysam.FastxFile().

karlrl commented 6 years ago

The flake8 problems are addressed as part of #166. I'm happy to cherry-pick that here if the other PR is rejected.

wasade commented 6 years ago

Sorry for the delayed response on this. While this could improve execution time for one aspect of the pipeline, the trimming component is not a dominant piece of the runtime. I'm concerned about adopting a complex dependency without stronger evidence of its need (and note trimming can be performed prior to deblur execution).