Lack of taxo.k2d in database

Yeahji9721 commented 2 months ago

I've made customised database by using the code

kraken2-build --download-taxonomy --threads ${NSLOTS} --db kraken2_db2024 kraken2-build --download-library nt --threads ${NSLOTS} --db kraken2_db2024 kraken2-build --download-library viral --threads ${NSLOTS} --db kraken2_db2024

kraken2-build --build --threads ${NSLOTS} --db kraken2_db2024

But it didn't have taxo.k2d in directory, so how can I solve this problem?

Also, another thing is if I use kraken2.1.2 instead of 2.1.3, does it affect the version of NT or viral I will get to download for the database ? Our uni cluster is currently using 2.1.2 ( I can use updated one via adding the code apptainer exec kraken2_latest.sif, but I would prefer to use the one we have for stability.)

jenniferlu717 commented 2 months ago

nt/viral version is based on when it is being downloaded, not the kraken version. kraken will just pull the most recent one.

Did the kraken2-build --download-taxonomy script work correctly? Based on the commands, it should be able to build the taxonomy file. What files do you have in your taxonomy/ folder?

Yeahji9721 commented 2 months ago

Yes I think it worked correctly but it halted during the process. I will just attach the job status below.

Downloading nucleotide gb accession to taxon map... done. Downloading nucleotide wgs accession to taxon map... done. Downloaded accession to taxon map(s) Downloading taxonomy tree data... done. Uncompressing taxonomy data... done. Untarring taxonomy tree data... done. Downloading nt database from server... done. Uncompressing nt database...done. Parsing nt FASTA file...done. Masking low-complexity regions of downloaded library... done. Step 1/2: Performing rsync file transfer of requested files Rsync file transfer complete. Step 2/2: Assigning taxonomic IDs to sequences All files processed, cleaning up extra sequence files... done, library complete. Masking low-complexity regions of downloaded library... done. Creating sequence ID to taxonomy ID map (step 1)... Found 112295042/112651381 targets, searched through 1026753790 accession IDs, search complete. lookup_accession_numbers: 356339/112651381 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [11m58.305s] Estimating required capacity (step 2)... Estimated hash table requirement: 953400694488 bytes Capacity estimation complete. [5h22m38.584s] Building database files (step 3)... Taxonomy parsed and converted. Failed attempt to allocate 953400694488bytes; you may not have enough free memory to build this database. Perhaps increasing the k-mer length, or reducing memory usage from other programs could help you build this database? build_db: unable to allocate hash table memory xargs: cat: terminated by signal 13

Yeahji9721 commented 2 months ago

So I think it is because of lack of RAM I guess. So what sort of RAM should I request to download and build the database ?

jenniferlu717 commented 2 months ago

Ah it stopped because you were unable to build the database . The taxonomy is not the issue here.

Do you need the full nt database? What kind of samples are you trying to run? The database size looks to be 953Gb

Yeahji9721 commented 2 months ago

I dont need human or any other higher animals. I mainly need from insect, virus, fungi,bacteria and etc. But it is needed to be updated as much as it can be.

Yeahji9721 commented 2 months ago

Is there anyway I can make database from NT but only with what I need then?

Ayala-Ruan-CesarM commented 2 months ago

@Yeahji9721 probably you should curate the nt information once is downloaded, what I mean by that is to remove the higher eucariotes information and then build the database with only the information that you need.

Yeahji9721 commented 2 months ago

Thank you for the comment. Could you specify it where I should delete from library/NT directory? Also, then you meant after downloading NT and viral, then delete higher eucariotes , then run the build command ?

Ayala-Ruan-CesarM commented 2 months ago

@Yeahji9721 Hi!, that is exactly what I had in mind. Once all the information is download there should be one (or several) files that contains information about which genome accesion ID correspond to which specie. I have not personally download the nt databaes therefore don't known what it downloads. You can write me to cesar.ayala@ibt.unam.mx for a strigthfoward conversation

DerrickWood / kraken2

Lack of taxo.k2d in database #826