chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

add test cases for maximum e-value filter on alignment results #7

Open katrinakalantar opened 4 years ago

katrinakalantar commented 4 years ago

Assertion: The maximum e-value for alignments in IDseq is 1.

Implementation Details: The maximum e-value threshold filter is applied in two different locations within the code base:

We expect that there may be alignments with e-values > 1 in the initial alignment files (gsnap.m8, rapsearch2.m8, gsnap.blast.m8, rapsearch2.blast.m8). The filter is then applied to the raw .m8 results when parsing for the top hits. There should never be e-values > 1 in the following files:

This was implemented as part of https://github.com/chanzuckerberg/idseq-dag/pull/309

Test Sample: This was tested on staging using benchmark sample UnAmbiguouslyMapped_ds.gut. In particular: staging sample ID 19379 was run prior to the fix, staging sample ID 19361 was run after the fix.

For exampe, in sample 19361, gsnap.m8 has 32 rows with e-value > 1, but gsnap.deduped.m8 has zero. rapsearch2.m8 has 45 rows with e-value > 1, but rapsearch2.deduped.m8 has zero. rapsearch2.blast.m8 has 5172 rows with e-value > 1, but rapsearch2.blast.top.m8 has zero.