DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

Why is my abundace estimation value zero for all classifications? How can this be turned on? #158

Open harisankar991 opened 5 years ago

chilltrout commented 5 years ago

Also very intrigued by this, went over the docs and I dont understand how to turn it on. Im assuming abundance is always on because I have gotten numbers populated with a very high coverage genome on pacbio data but never on my nanopore data

harisankarsadasivan commented 5 years ago

Yes, makes sense. I faced the same with nanopore data, minion v9.4.1.

shashibioinfo commented 5 years ago

Hi sir, even i had the same issue i have analyzed by minION nanopore data using centrifuge tool the output files shows abundance as zero.

how to resolve this ? please help me any valuable suggestions will be appreciated

Thank you

shashibioinfo commented 5 years ago

Yes, makes sense. I faced the same with nanopore data, minion v9.4.1.

even i have same issue if you have solved this can you please help me to resolve the issue Thank you

ExplodingCabbage commented 5 years ago

I've seen the same thing. One species has >80% of all reads assigned to it, according to the output TSV, yet its abundance is still listed as 0.0, just like every other row.

guokai8 commented 4 years ago

I have the same issue. I don't known why.

mourisl commented 4 years ago

Can you check whether there these are unique assignment or not? Thanks.

guokai8 commented 4 years ago

Yes. I am sure there are unique reads here.

Aiswarya-prasad commented 4 years ago

Has anyone been able to resolve this? I am having the same issue with nanopore reads.

mourisl commented 4 years ago

I'm checking on this issue. The abundance estimation is on by default. Does any of the read's assignment to the subspecies(leaf) level? Can you show me a few lines of the report file? Thanks.

jmaricb commented 4 years ago

Hi, are there any updates with this issue? I seem to be getting zero abundances for every species. Here are the commands I have been using:

centrifuge \ -x data/classifiers-DB/centrifuge/p_compressed+h+v \ -p 8 \ -f data/reads-fastq/ONT/communities-synthetic/integration_dataset.fasta \ -S out \ --report-file report

After that I also used this command to get kraken style report: centrifuge-kreport \ -x data/classifiers-DB/centrifuge/p_compressed+h+v \ out > kraken_report

You can download the output here: https://www.dropbox.com/s/a5j415ixyts9lox/Archive.zip?dl=0 You can see that there are species in the kraken_report file that have high abundance, and also in the report file you can see that there are species with high number of unique reads, but the abundance is still zero for all the rows.

mourisl commented 4 years ago

Thanks for sharing the files. I'll look into this.

mourisl commented 4 years ago

You are using the p_compressed+h+v index, however the seqId column from the output is not in the form of cid|XXX from the compression. I guess the index you are using is actually p+h+v. Could you please check whether the index is correct?

jmaricb commented 4 years ago

Hi, thank you for the response. Sorry, I have sent you the data I have classified with custom database that I built from Bacteria and Archaea genomes. The commands I used to build the database are:

centrifuge-download -o taxonomy taxonomy centrifuge-download -o library -m -d "archaea,bacteria" refseq > seqid2taxid.map cat library/*/*.fna > input-sequences.fna centrifuge-build -p 10 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna abv

I think that succeeded, because as you can see the kraken report gives reasonable classification.

I am now sending new data: https://www.dropbox.com/s/cefkjfz0a4kq1ig/Data.zip?dl=0

Here is the dataset I've been using. It's quite large so I am sending it separately: https://www.dropbox.com/s/jeuaho0slc45p9w/silico.fastq.zip?dl=0

jmaricb commented 4 years ago

@mourisl Hi, one more thing. I don't know if it can help. But my integration.fastq dataset also works with custom index that I created, so the problem might not be in the indexes.

Here are the results of the classification with the custom index: https://www.dropbox.com/s/iojc2br7q17ru1m/integration_custom.zip?dl=0

jmaricb commented 4 years ago

@mourisl Hi, can you just help me to calculate the abundances by myself. I would like to do that, but in the centrifuge output reads are classified to multiple species. How can I determine to which species each read should classify? Is there a way for centrifuge to determine one species to which certain read should classify to?

Thank You.

mourisl commented 4 years ago

@jmaricb You can directly use the abundance from kreport. For the multiple-assigned reads, the count will be added to their lowest ancestor in the taxonomy tree. You can also use "--no-lca" in kreport, which add the count to a strain in the fraction of the number of assignment.

jmaricb commented 4 years ago

@mourisl Sorry for bothering you, but just one more question. If I have a read that is mapped to three tax ids, like this: SRR5891470.22869 species 106654 676 676 41 2302 3 SRR5891470.22869 species 470 676 676 41 2302 3 SRR5891470.22869 NZ_CP033858.1 2420300 676 676 41 2302 3

In the report (let's say kreport), this read will be assigned to lowest ancestor of these three tax ids (106654, 470, 2420300), which is Acinetobacter (tax id = 469)? Am I right?

Does this mean that only reads that map to single species will be assigned to that species?

Thank You.

mourisl commented 4 years ago

@jmaricb Yes, that is the default behavior of kreport. You can use "--no-lca" in centrifuge-kreport to put fraction of a read to the species. Note that, Centrifuge already assigns a read to its lowest common ancestor if it assigned to too many species (-k option).

jmaricb commented 4 years ago

@mourisl Thank you very much. I think I got everything I need to calculate the abundances.

May I just know one last thing. "--no-lca Do not report the LCA of multiple assignments, but report count fractions at the taxa."

How do you calculate count fractions for each species from multiple assignments when you use --no-lca?

mourisl commented 4 years ago

@jmaricb If a read is assigned to 4 species, the the four species' abundance will add 0.25.

jmaricb commented 4 years ago

Thank You for you help.

Adoni5 commented 4 years ago

@mourisl

I am also using a compressed index (p_compressed hosted on the site) with nanopore reads, and am getting an abundance of 0. I am building a custom index of bacteria from refseq to test if the compressed indexes are the problem, but was wondering of there is anything else you would recommend trying?

Sample ouput -

readID  seqID   taxID   score   2ndBestScore    hitLength   queryLength numMatches
2bef9c72-eeab-4b54-b7a0-4f4696866878    NC_018695.1 1229205 225 225 30  215 2
a6d6c54d-b1e2-45ee-858f-0cb61d0fc2f5    NZ_CP016077.1   1612551 121 121 26  439 2

Sample report -

name    taxID   taxRank genomeSize  numReads    numUniqueReads  abundance
Myxococcales    29  order   9697933 1   0   0.0
Cystobacter fuscus  43  species 12349744    1   1   0.0
xiechangxiao commented 4 years ago

I have the same issue. The abundance value always get 0 when I use the latest verion centrifuge and h+p+v+c database analysis nanopore data. Could you help me correct it, thank you. Here is my code. centrifuge -x database/centrifuge_databases/hpvc/hpvc -U BC_25.fq.gz --report-file BC_25.report -S BC_25.output

tanushrin commented 3 years ago

I am having issue with the abundance estimation; getting 0 abundances for most of the species except one species (with abundance value: 1). In the centrifuge_report.txt, there are species with high abundance however, centrifuge_report.tsv shows abundance as 0. I created a custom database : archaea, bacteria, protozoa, fungi, plant, algae

Here are the centrifuge commands I have been using:

centrifuge-build -p 24 --conversion-table $REF_SEQ_DIR/accession2taxid_cent.map --taxonomy-tree $REF_SEQ_DIR/nodes.dmp --name-table $REF_SEQ_DIR/names.dmp $DB.fa $DB > $DB.log

centrifuge -p 24 -x $DB -q in.fq > out.txt

centrifuge-kreport -x $DB out.txt > centrifuge_report.txt

How to get proper(non-zero) abundance values? Would appreciate any help.

Thank you!

Kumereng commented 3 years ago

Hi i have exactly that same issue which has not been resolved. The abundance is also zero.

lixiaopi1985 commented 3 years ago

same issue with the latest Centrifuge.

mourisl commented 3 years ago

I just fixed an issue with estimating average genome sizes, which was also related to the abundance estimation procedure. Could you please try the new version and check whether the abundance values become normal? You don't need to rebuild the index.

BaylorLyu commented 3 years ago

I just fixed an issue with estimating average genome sizes, which was also related to the abundance estimation procedure. Could you please try the new version and check whether the abundance values become normal? You don't need to rebuild the index.

The problem still have in current version,only few cloumn have abundace value

sybrohee commented 1 year ago

Unfortunately, still having the same issue. All abundances stay equal to 0.0 and no iteration was performed.

mourisl commented 1 year ago

I can reproduce the zero abundance issue on one of the data sets. I'm working on it now, and it seems more complex than I thought.

sybrohee commented 1 year ago

@mourisl Thank you for considering the issue (and all your nice work with centrifuge)