biocore / deblur

Deblur is a greedy deconvolution algorithm based on known read error profiles.
BSD 3-Clause "New" or "Revised" License
92 stars 41 forks source link

CRITICAL: similarity and coverage thresholds are not used in positive mode #139

Closed wasade closed 7 years ago

wasade commented 7 years ago

Despite the docs, they are not used.

Even had they been, a coverage threshold of 30% is massively permissive and allows obviously non-reference sequences to filter through.

Issuing a PR in a moment.

amnona commented 7 years ago

Yes, this is a remanent of the old threshold we used and the doc was not updated. Nice catch :) The current version uses the values supplied if it is negative mode, and instead just looks at the e-score if positive filtering (it is permissive since we are looking for any 16S sequences when comparing to gg 88%). Should update the docs. Didn't see the PR.

On Tue, Feb 21, 2017 at 6:28 PM, Daniel McDonald notifications@github.com wrote:

Despite the docs https://github.com/biocore/deblur/blob/master/deblur/workflow.py#L391-L400, they are not used https://github.com/biocore/deblur/blob/master/deblur/workflow.py#L464-L466 .

Even had they been, a coverage threshold of 30% is massively permissive.

Issuing a PR in a moment.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/deblur/issues/139, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkA8tOr2BROOi-kg5XCxCBm88FC8Itaks5re50zgaJpZM4MIIL_ .

wasade commented 7 years ago

It is overly permissive. This is a bug not a feature.

wasade commented 7 years ago

I haven't issued a PR yet as I'm still evaluating it

amnona commented 7 years ago

Note the 30% is not used, as we are using e-value<=10 instead, as recommended by @ekopylova

wasade commented 7 years ago

Yes, I know.

The e-value filter alone is insufficient.

wasade commented 7 years ago

To determine rational parameters, did a quick analysis using Greengenes 88% OTUs (contains 10,544 records) to assess what coverage and similarity thresholds make sense for recruitment of EMP V4 reads. The strategy was as follows:

Analysis results are here.

At each trim level, a small number of reads fail to recruit which appears to decrease with query length. Information on what record, the number of times it failed to recruit, the number of times it was used as a query, and its lineage information according to GG 13_8 is included. Some are surprising, such as 2278795, since it is part of a well sampled lineage suggesting that it may be an artifactual sequence in the database (as the database is not perfect). Others are not so surprising, such as the archaeon 1129716.

Interestingly, we see that the number of mishits decreases with read length.

Examining the coverage and similarity thresholds for those records that do recruit, we see that a minimum % ID of 65% seems to be the lowest observed, with the vast bulk of the queries having much higher identity.

Coverage is a bit more difficult to resolve especially as the minimum coverage is very low; a histogram can be found here. If we look at all queries, over all iterations, we had 0.2% fall below 50% coverage. On closer inspection, one such record 3779572, it turned out the original record and query are packed full of Ns -- a definite surprise as that record should not exist in Greengenes. We can also see that many of the records with an low coverage below 50% are likely to be artifacts of the database as many are from well sampled lineages (e.g., the one with lowest mean coverage) [n.b. MEAN-COVERAGE here is mean coverage for the queries below 50% coverage]. However, there are some notable examples of candidate phyla being missed, so even with a 50% coverage requirement, there is still a potential for false positives. However, across all cases, there appears to only be a few instances where a query was always below coverage (e.g., 3779572 at 100nt and an artifact as noted above, 833140, etc) suggesting that while some lineages may be sensitive, a 50% threshold is probably a reasonable balance especially when paired with a permissive sequence identify threshold.

It does appear that the default e-value of 1 is fine, and could probably be made more stringent. Note: SortMeRNA's default is 1 too so we do not expect values > 1 in that plot.

Given the above, it seems that a 65% sequence identity threshold and a 50% coverage threshold for positive filtering is probably reasonable.

antgonza commented 7 years ago

Thanks for checking that, I think the 65% sequence identity and a 50% coverage thresholds make sense based on this data.