Open sbresnahan opened 1 year ago
Hi Sean, if I'm understanding, you are looking for an LCA type algorithm to be implemented? In my experience, I've never been super happy with the level of added confidence provided by such approaches given the added complexity. I wrote this code to be pretty straightforward with minimal bells and whistles, relying predominately on percent identity at high query coverage. I've played around with other features, such as distance between top-hit and second-hit taxon (https://github.com/RTRichar/glmmSeq) but getting worthwhile predictive power out of these has never been as straightforward as it would seem in my hands.
It doesn't need to be that fancy - you could just set --maxhits
to e.g., 50, pull taxonomy information for the hits, and filter reads with matches above e.g., 98% identity to multiple genera. You're right though, I checked through a few dozen low-throughput libraries and no more than 2% of the reads in any sample matched multiple genera at that threshold; not as much of an issue as I had assumed it would be.
If I understand the vsearch documentation correctly,
vsearch --usesearch_global --maxhits 1
will only output the top hit (highest sequence identity) for each read. Therefore, any reads with multiple hits at the same % sequence identity will be assigned to one of those sequences at random. Given the downstream application of these read assignments, I think this setting is problematic. Users may wish to filter reads with multiple equally-good hits. Can you add a step between alignment with vsearch and 3_VsearchToMetaxa2.py (or an option w/in the python script) to filter multi-mapping reads?