RTRichar / SimpleSequenceClassification

Code and commands used for metagenetic sequence classification using Vsearch semi-global alignment
0 stars 0 forks source link

Multi-mapping reads are an issue for metabarcoding #1

Open sbresnahan opened 1 year ago

sbresnahan commented 1 year ago

If I understand the vsearch documentation correctly, vsearch --usesearch_global --maxhits 1 will only output the top hit (highest sequence identity) for each read. Therefore, any reads with multiple hits at the same % sequence identity will be assigned to one of those sequences at random. Given the downstream application of these read assignments, I think this setting is problematic. Users may wish to filter reads with multiple equally-good hits. Can you add a step between alignment with vsearch and 3_VsearchToMetaxa2.py (or an option w/in the python script) to filter multi-mapping reads?

RTRichar commented 1 year ago

Hi Sean, if I'm understanding, you are looking for an LCA type algorithm to be implemented? In my experience, I've never been super happy with the level of added confidence provided by such approaches given the added complexity. I wrote this code to be pretty straightforward with minimal bells and whistles, relying predominately on percent identity at high query coverage. I've played around with other features, such as distance between top-hit and second-hit taxon (https://github.com/RTRichar/glmmSeq) but getting worthwhile predictive power out of these has never been as straightforward as it would seem in my hands.

sbresnahan commented 1 year ago

It doesn't need to be that fancy - you could just set --maxhits to e.g., 50, pull taxonomy information for the hits, and filter reads with matches above e.g., 98% identity to multiple genera. You're right though, I checked through a few dozen low-throughput libraries and no more than 2% of the reads in any sample matched multiple genera at that threshold; not as much of an issue as I had assumed it would be.

p2