esteinig / nanoq

Minimal but speedy quality control for nanopore reads in Rust :bear:
MIT License
109 stars 9 forks source link

[JOSS Review] Differences from filtlong #20

Closed bovee closed 2 years ago

bovee commented 3 years ago

I was comparing nanoq to filtlong and the timing/performance seem comparable to what you've documented, but I see some differences in the output files I get.

For example, I pulled the first four million reads of the Zymo even data and ran it through both filtlong and nanoq with -p 80 -b 500000000 settings and got somewhat different sets of output reads. Comparing the two, I see 42994 reads were only in filtlong's output, 20795 were shared, and 91994 were in nanoq's output only. This might be because filtlong imposes a 5 kbp minimum contig length filter and nanoq doesn't, but I'm not sure how to set the same length prefilter in nanoq to compare the results.

Read Summary Filtlong Nanoq
Number of reads 63,789 112,789
Number of bases 500,002,622 499,995,330
N50 read length 7,674 4,889
Longest read 48,626 22,047
Shortest read 5,000 173
Mean read length 7,838 4,433
Median read length 7,400 4,211
Mean read quality 8.61 9.04
Median read quality 8.65 9.00

I think either the filtering algorithm should be explained a bit more beyond "extended two-pass filtering analogous to Filtlong", I should be able to set parameter combinations that get results closer to filtlong, or both?

luizirber commented 3 years ago

(referencing back to https://github.com/openjournals/joss-reviews/issues/2991 so it shows up over there)

esteinig commented 3 years ago

@bovee thanks a ton, yes the length filter seems to be the default in filtlong. i thought it might be clearer to separate those filters in two individual steps, but can see how they should probably be implemented in one.

i will work on this tonight on a new branch, appreciate the review over the weekend :)

esteinig commented 2 years ago

@bovee thanks so much for your patience with this, it has been a bit of a wild year.

I have removed the more complex (two-pass) filters to keep in line with the design philosophy for speed and minimal quality controls. Not sure about your experience, but have never really used the interesting filtlong filters, and it seems to me they are more geared for research, rather than an implementation for speed and stability in production.

Rewrote the code base (it was a bit of a mess before) + added better tests, documentation, benchmarks and continuous integration. It seems like needletail is around twice as fast as rust-bio-tools sequence-stats in fast mode which ignores the quality scores :tada: Also confirmed output between all programs in the benchmarks --> #17 and #18

esteinig commented 2 years ago

Closing this for now addressed in new version and latest paper iteration.