DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

Centrifuge multiple matches of the same read to the same Tax ID #98

Open asulit08 opened 6 years ago

asulit08 commented 6 years ago

I've been reviewing the classification (-S) file output of the centrifuge and there are times I see something like this:

qreadID | seqID | taxID | score | 2ndBestScore | hitLength | queryLength | numMatches

7001326F:118:CBRP4ANXX:4:2210:3376:74453 | kraken:taxid|1352 | 1352 | 1225 | 1225 | 50 | 156 | 2 7001326F:118:CBRP4ANXX:4:2210:3376:74453 | kraken:taxid|1352 | 1352 | 1225 | 1225 | 50 | 156 | 2

It would seem that the same read matched to the same taxon twice, with the same scores and hit and query lengths. This would be a problem for the counts of unique reads wouldn't it?

mourisl commented 6 years ago

Can you share the sequence of that read?

jsh58 commented 6 years ago

@asulit08 This occurs when multiple reference sequences of the same taxon share a subsequence (or when a single reference sequence has a repeat). A read that matches multiple references/locations is not considered unique, so it does affect the count.

mourisl commented 6 years ago

For the case that a single reference sequence has a repeat, Centrifuge will only report once. I agree that this is more like that there are duplicated reference sequences in the database.

mourisl commented 6 years ago

@asulit08 Have you been able to resolve this issue?

asulit08 commented 6 years ago

well, i built the database by downloading sequences from the refseq databases as per instructions in the manual, so if there are multiple reference sequences there that are tagged to 1 taxon, it really won't be classified as unique will it? Furthermore if there are sequences whose tax IDs are not found in the tree (i.e. those with warnings on the database build), then the read also won't be classified as unique. Is there a work around for that?

mourisl commented 6 years ago

1) Yes. If a read get assigned to the reference sequence level and numMatches is 1, even there might be multiple reference sequence with the same taxonomy id, it is unique. 2) If you find non-existed tax ids or accession ids (seqId) which is from the header of its FASTA file, you can add those entries in nodes.dmp, names.dmp and the accession id to tax id conversion table file. And then directly run centrifuge-build.

asulit08 commented 6 years ago

from my results, I get one read that is matched to the exact same taxon, hit, and lengths more than once and the numMatches reflect that number. So even if that read is matched to just one taxon, it is flagged as not unique?

and thank you for the advice for the centrifuge-build

mourisl commented 6 years ago

Is this the read you gave at the beginning of this issue post? It seems there is multiple reference genomes with the header ">kraken:taxid...." in the FASTA files and centrifuge-build failed to parse the header and got multiple accession id "kraken:taxid". This accession id may corresponds to some taxonomy id I'm not sure of. So, I would still regard it as not a unique hit.