DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
249 stars 73 forks source link

incorrect genome sizes #13

Open igordot opened 8 years ago

igordot commented 8 years ago

It looks like the summary report may be reporting wrong genome sizes.

For human (taxID 9606): From report: 6,339,524,059 (2X bigger than expected) From NCBI: median total length (Mb): 2996.43

For gorilla (taxID 9593): From report: 19,140,263 (100X smaller than expected) From NCBI: median total length (Mb): 3058.03

For Picea glauca (taxID 3330): From report: 26,852,969 (1,000X smaller than expected) From NCBI: median total length (Mb): 25784.7

infphilo commented 8 years ago

We'll fix this issue in the next version. Thank you for bringing this problem to our attention.

igordot commented 8 years ago

Great! On a related note, I would also suggest having both normalized and non-normalized abundance.

osris commented 7 years ago

doesn't seem to have been fixed yet. E coli seems to have a length of 12319210 . its not the total length of all the complete reference genomes. maybe after masking? i assume the abundance calculation is based on the genome length?

fconstancias commented 7 years ago

I am also interested in genome size normalized abundances. Do you still plan to fix this?