Open Yvain-Desplat opened 3 years ago
Hi,
I get exactly the same problem. The nr protein database builds, no problem at all, even very fast compared to nt. But it seems like it has lost most of the sequences on the way ! My initial nr library contained 425,967,191 amino acid sequences but I get this:
Completed processing of 5494 sequences, 1654257 aa Writing data to disk... complete. Database files completed. [14m45.464s] Database construction complete. [Total: 22m57.504s]
And when I test this db with a dataset, I get 99,8% unclassified. The same dataset ran with the nt database built 2 days ago, gets 12% unclassified. Very strange.
My build commands are the following (I use 12 cpus*24G of RAM per cpu), I am using kraken2 version 2.1.1: kraken2-build --download-taxonomy --db nr kraken2-build --download-library nr --protein --db nr kraken2-build --build --threads 12 --protein --db nr
I had done it in the past and it was working fine (probably with older version of Kraken2), with all sequences represented in the nr protein database. Do you have any idea how to fix this ?
Hi,
try to use the --protein
flag also for the --download-taxonomy
step, as this will fetch the correct id to tax mapping file from NCBI. My DB is now still being built (for quite some time) but I hope it will finally work.
downloaded nr DB properly, parsed sequences, masked low diversity all good but when building the database I get the following:
kraken2-build --build --db nr-ncbi/ --threads 80
Creating sequence ID to taxonomy ID map (step 1)... Found 7912/713878084 targets, searched through 812094487 accession IDs, search complete. lookup_accession_numbers: 713870172/713878084 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [1h4m43.835s] Estimating required capacity (step 2)... Estimated hash table requirement: 48272 bytes Capacity estimation complete. [8m28.801s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 11 bits reserved for taxid. Completed processing of 5491 sequences, 1647979 bp Writing data to disk... complete. Database files completed. [12m43.872s] Database construction complete. [Total: 1h25m56.524s]
which indicates the build didn't work. Tried to pass kraken2-build --build --protein --db nr-ncbi/ --threads 80 but same output.
anyone know where the issue comes from/how to fix it ?