NVIDIA-Genomics-Research / GenomeWorks

SDK for GPU accelerated genome assembly and analysis
https://clara-parabricks.github.io/GenomeWorks/
Apache License 2.0
286 stars 76 forks source link

[cudamapper] Filtering parameters are too stringent for very small read sets #504

Closed edawson closed 4 years ago

edawson commented 4 years ago

While testing cases related to #503 , it became apparent that for very small readsets (e..g, two reads) the default filtering parameter -F is too stringent. Values of -F smaller than approximately 0.001 produce no overlaps.

The right way to fix this is to properly handle repetitive minimizers. We could do this with a fixed mask, a weighting function like that used in WinnowMap, or by rearchitecting the sketch handling in cudamapper to function like MashMap. As a temporary fix, it might make sense to use a filtering parameter value scaled by the number of reads in the input data (probably growing 1 / (number of reads)^2, with a minimum of 2e-4).