DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
240 stars 73 forks source link

Discrepancies in centrifuge output #96

Closed asulit08 closed 6 years ago

asulit08 commented 6 years ago

I am having some issues with the output of my centrifuge runs.
1) For instance in my kreport, if you add the number of unclassified reads and the root, they will not add up to the total number of input reads I used. 2) The numbers in the report.tsv files and the kreport files do not match as well – for example, for tax ID 562 (Escherichia coli) in kreport I’d have 20139 and 19030 reads but in report.tsv I’d have 26712 and 14968 reads for total number, unique number respectively. 3) There are some taxIDs that are in kreport which I cannot find in the report.tsv files such as taxID 83334 (Escherichia coli O157:H7). Notably, I am able to find this in the main classification output file. 4) I have several seqIDs with taxIDs as “0”, seqIDs with ‘kACTACACCAGGGGGT’, and seqID’s with no rank. This may be a result of discrepancies between the sequences, nodes and names dmp I used in building the indices?

I am hoping someone could give light to these problems. Thank you

asulit08 commented 6 years ago

I've also just checked. I am getting for instance "Warning: taxomony id doesn't exists for kraken:taxid|49899" even though I can find it in my seq2taxid map file:

kraken:taxid|49899 49899

GabrieleNocchi commented 6 years ago

I just run into a similar problem. I have used 2 files to test centrifuge. These 2 files are the miseq_accuracy.fa and hiseq_accuracy.fa that have been used previously to test kraken. They contain 10,000 sequences each. Available at: http://ccb.jhu.edu/software/kraken/dl/accuracy.tgz

When I run centrifuge (with the setting -k 1, to report only 1 hit per sequence) some sequences are not analyzed. If you count the sequences in the output file (produced with the command -S) 2 reads are not analyzed from hiseq_accuracy.fa and 9 reads are not analyzed from miseq_accuracy.fa.

You can simply check it by counting the number of lines in the -S output file (keep in mind that in the -S output file one line is the column name and one empty line at the end). This file also contains the unclassified reads (with TaxID 0) so the sequences should add up to the initial number.

I also had similar ID issues to asulit08 when importing the data into krona to visualize it.

asulit08 commented 6 years ago

So far, I've tracked down that the classification and report.tsv files coincide with one another (too bad report.tsv doesn't output unclassified read numbers). For the kreport output, there are really some counts that I cannot understand where they're coming from. For example, for the "root" taxon, the number of unique reads mapping to it indicate '0' but if you take a look at the classification (-S) file, there are reads with numMatches indicating "1" but with taxID "1" (which is the root tax ID)...

I hope someone can clarify this?

(@GabrieleNocchi thanks for the tip on -k 1, it made comparisons easier)

mourisl commented 6 years ago

@GabrieleNocchi What do you mean by not analysis? Do you mean the read is not in the raw output of centrifuge at all? Which version of Centrifuge did you use?

@asulit08 The seqId of kACTACACCAGGGGGT looks very strange. Most likely the database you downloaded is damaged. Can you try the database from our webpage just to make sure the issue is not from the sequences?

asulit08 commented 6 years ago

Thanks for the reply. This was from a database I built that included complete genomes, chromosomes, and scaffolds from bacteria, fungi, archaea, and viruses (it was big). I didn't find it when the database I use is only from complete genomes and chromosomes. I still do not understand how the counts in the kreport file come about though. The unclassified indicated in the kreport file does not match the taxID "0" from the -S classification file (however it does match when I set -k to 1). The total number of classified reads in kreport also does not match the number of reads in report.tsv and the classification file

mourisl commented 6 years ago

@asulit08 For each read in the input, it can be classified to x (up to k) species. So in kreport, in the x species, this read will contribute 1/x each. Those number will be round up to integers and this result in inconsistency number of reads.
The numbers of report.tsv are explained in https://ccb.jhu.edu/software/centrifuge/manual.shtml#centrifuge-classification-output

Does the taxID "0" in your classification file contain entries other than "unclassified"?

asulit08 commented 6 years ago

Yes, I get kraken:taxid| as well as no rank

How are unclassified reads counted as well? I get reads that have marked unclassified but also matched to other taxa

mourisl commented 6 years ago

If it is unclassified, then it won't have any more classification information. I guess then you might have duplicated read ids in your sequence file. Did you run the paired-end file with -1,-2 option? Otherwise, the taxonomy tree might have some problems.

asulit08 commented 6 years ago

yes I ran the paired-end file with -1, -2 option...I did add singletons but when I count all unique reads they match the sum of the number of pairs and the singletons I used as input so I'm doubtful this is the problem

what I meant was these reads were identified with a seqID (e.g. kraken:taxid| ) but with a TaxID corresponding to "0" so they weren't labeled as unclassified but that TaxID was 0... then some of them would have matches to a classified taxa

mourisl commented 6 years ago

The seqID of kraken:taxid| and the kACGT... stuff are problematic and could not find a place in the taxonomy tree (you saw the warning when building the index). So Centrifuge just put a taxonomy id 0 there. In other words, they are classified, but due to the problems of the database, Centrifuge could not find the associated taxonomy id.

asulit08 commented 6 years ago

yes, so will they be counted among the classified or will they be counted towards unclassified reads? I'm trying to reconcile the numbers I have to add up to the total input I used and I can't get them (unless I am using -k 1)

mourisl commented 6 years ago

I'm not sure about this. If you saw those "kraken:taxid|" in the centrifuge_report.tsv, then the report file does not consider them as unclassified. But I think centrifuge-kreport will regard them as unclassified. You can check those.

GabrieleNocchi commented 6 years ago

@mourisl I use 1.0.3-beta.

Yes I mean they are not in the centrifuge output at all. I think they are skipped or I do not know what happens. I run centrifuge with:

centrifuge -x (index) -k 1 -f -U (fasta file) --report-file (report file name) -S (classification output file name)

The -S file has a line for each sequence in your input file with classification results. It also include unclassified sequences which are classed as seqID= unclassified taxID= 0; if I count the lines of this file (excluding of course the first line which has the cokumns names and the last line which is empty) the number of sequences do not reflect the one I sent in the input file. I also wrote a perl script to sum the column "numReads" in the report file and it reflect the -S file. some sequences are missing. (2 for hiseq and 9 for miseq).

I use the index downloadable from the centrifuge website: p+h+v

GabrieleNocchi commented 6 years ago

@mourisl I have noticed that you closed an older issue pointing out that there is a non beta version of 1.0.3. Where can I download that? I can not see it a: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads

Also by downloading via conda with: conda install -c bioconda centrifuge I still get the beta version.

mourisl commented 6 years ago

@GabrieleNocchi The newest version now only can be downnloaded from the github. And you have to compile the source code.

jsh58 commented 6 years ago

@GabrieleNocchi

Yes I mean they are not in the centrifuge output at all. I think they are skipped or I do not know what happens.

The reads probably have a large number of ambiguous bases (Ns). Centrifuge does not analyze those.

GabrieleNocchi commented 6 years ago

@jsh58 Yes thank you. I just double checked this using a Biopython script called sequence_cleaner.py ; I think centrifuge automatically removes sequences which have more than 10% N. Using sequence_cleaner.py to filter the hiseq_accuracy.fa with threshold set to 10% N, 2 sequences are filtered out. Exactly the number of sequences I missed in my centrifuge output.

jsh58 commented 6 years ago

@asulit08

For instance in my kreport, if you add the number of unclassified reads and the root, they will not add up to the total number of input reads I used.

This due to a bug in the centrifuge-kreport script that was fixed as part of this pull request.

(I didn't like the second part of that PR, so I have my own debugged version here.)

asulit08 commented 6 years ago

Thank you all for your answers. I'm still having trouble reconciling the kreport to my other classification files (-S and report.tsv), although I've been able to reconcile the -S and report.tsv files already. I think the problems I'm encountering are due to the warning errors given by the database build of centrifuge. I wrote a script getting all reads with tax ID = 0 (includes unclassified and those with unrecognized seq IDs), then removing any read with multiple matches to other taxa (i.e. with classification but has multiple matches to sequences whose tax ID or seq ID are unrecognized) and getting all unique reads from these. I also got unique reads that were classified. Adding the two sets sum up to the number of reads of the input so that answers the questions for the report.tsv files.

However, I still do get problems with my kreport as I cannot identify how it gets the numbers it outputs. For instance, the number of unclassified reads given are not the same as the number I get from the -S classification report. The root row still does not give any counts for the unique reads column although the classification file reports reads with number of matches = 1 that matched only to root. I tried @jsh58 's script and I got the same results as before (I'm using the latest centrifuge version in the github repository). There are also some with names (e.g. name =12, tax ID=12) that is recognized by report.tsv but isn't recognized by kreport. Granted when I searched this tax ID, it has been merged with another taxID in NCBI taxonomy. I'm thinking these reads not being counted may be one reason for the discrepancies with the other report files? Add to that the rounding off happening with the read counts with multiple matches?

I've also tried running centrifuge with option -k1 and the kreports here are much better. I am able to obtain the right number of unclassified reads and I can also pinpoint the dicrepancies with report.tsv are due to the 1) the reads with recognized tax IDs but no names (which are not found in kreport) and 2) the number of unique reads matching to root (tax ID =1)

mourisl commented 6 years ago

We've made a change to centrifuge-kreport, such that it by default use the lowest common ancestor of the assignments if a read is assigned to different taxonomy ids. So the behavior should be like using (-k 1).

Hope this change woule make the result more interpretable.