how does syrah relate to trim-low-abund.py in khmer?

phiweger commented 7 years ago

A recommendation in https://github.com/dib-lab/sourmash/issues/283 was to trim k-mers to avoid the collection of sequencing errors. How are the above two approaches related. From a quick glance at the code, there seems to be some overlap. Are they basically doing the same?

Thx!

ctb commented 7 years ago

Same in spirit, but there are a few differences --

syrah splits reads on N/errors, trim-low-abund truncates - so drops more data. This is fixable!

syrah is much less configurable, and so doesn't work for as many situations yet (while trim-low-abund is super flexible and therefore also very confusing).

trim-low-abund permits both streaming and semi-streaming (and the latter uses some amount of disk space for large/low-coverage data sets). syrah is pure streaming and uses no disk space.

syrah was built for this project,

ivory.idyll.org/blog/2017-sourmash-sra-microbial-wgs.html

and it's not clear how general it is. t-l-a is my current recommendation.

phiweger commented 7 years ago

very clear explanation, thanks

dib-lab / syrah

how does syrah relate to trim-low-abund.py in khmer? #10