DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
711 stars 271 forks source link

NR database not building properly #476

Open Yvain-Desplat opened 3 years ago

Yvain-Desplat commented 3 years ago

downloaded nr DB properly, parsed sequences, masked low diversity all good but when building the database I get the following:

kraken2-build --build --db nr-ncbi/ --threads 80

Creating sequence ID to taxonomy ID map (step 1)... Found 7912/713878084 targets, searched through 812094487 accession IDs, search complete. lookup_accession_numbers: 713870172/713878084 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [1h4m43.835s] Estimating required capacity (step 2)... Estimated hash table requirement: 48272 bytes Capacity estimation complete. [8m28.801s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 11 bits reserved for taxid. Completed processing of 5491 sequences, 1647979 bp Writing data to disk... complete. Database files completed. [12m43.872s] Database construction complete. [Total: 1h25m56.524s]

which indicates the build didn't work. Tried to pass kraken2-build --build --protein --db nr-ncbi/ --threads 80 but same output.

anyone know where the issue comes from/how to fix it ?

jhayer commented 3 years ago

Hi,

I get exactly the same problem. The nr protein database builds, no problem at all, even very fast compared to nt. But it seems like it has lost most of the sequences on the way ! My initial nr library contained 425,967,191 amino acid sequences but I get this:

Completed processing of 5494 sequences, 1654257 aa Writing data to disk... complete. Database files completed. [14m45.464s] Database construction complete. [Total: 22m57.504s]

And when I test this db with a dataset, I get 99,8% unclassified. The same dataset ran with the nt database built 2 days ago, gets 12% unclassified. Very strange.

My build commands are the following (I use 12 cpus*24G of RAM per cpu), I am using kraken2 version 2.1.1: kraken2-build --download-taxonomy --db nr kraken2-build --download-library nr --protein --db nr kraken2-build --build --threads 12 --protein --db nr

I had done it in the past and it was working fine (probably with older version of Kraken2), with all sequences represented in the nr protein database. Do you have any idea how to fix this ?

andreott commented 3 years ago

Hi, try to use the --protein flag also for the --download-taxonomy step, as this will fetch the correct id to tax mapping file from NCBI. My DB is now still being built (for quite some time) but I hope it will finally work.