DiltheyLab / MetaMaps

Long-read metagenomic analysis
Other
98 stars 23 forks source link

Misclassification when close relatives are present #63

Open ana-re opened 2 years ago

ana-re commented 2 years ago

Hi @AlexanderDilthey

We spotted an issue with misclassification, particularly within the Brucella genus. We generated FASTQ files containing simulated ONT reads for 4 Brucella species and analysed them with Metamaps using a genus-level database constructed from all of the available RefSeq genomes for all Brucella species.

The results looked great when Metamaps was ran on FASTQ files containing just one species, the percent of correctly classified reads being 100% for 3 of the species and 99.95% for one of them. However, when concatenating the 4 FASTQ files so that the input file contains all 4 of our Brucella species, the percent of correctly classified reads dropped to as low as 1.18% for one of the species, and 39.9%, 46.93%, and 99.94% for the others.

I was hoping you could please investigate this and let us know how we can improve the classification in our analysis pipeline, which incorporates Metamaps.

AlexanderDilthey commented 1 year ago

Hi @ana-re,

My assumption would be that, when running MetaMaps on the concatenated input, the mis-classified reads end up as classified against one of the other 3 (in this case incorrect) speceis?

I assume the issues you observe are related to sequence homologies between the Brucella genomes. You could try to reduce the window size (parameter -w for the mapping step) or reduce the kmer size (parameter -k) for the initial mapping step, which may increase the sensitivity of the mapping with respect to small-scale differences between the input genomes.

Best wishes

Alex