DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
719 stars 271 forks source link

Custom database build only partially complete #295

Closed idoerg closed 4 years ago

idoerg commented 4 years ago

Hi,

I tried to build the bacteria database using

kraken2-build --threads 24 --download-library bacteria --db /work/idoerg/db/k2bac

Seems like the process ended midway without notifying of errors, but also without a completely built database.

ls -lrt /work/idoerg/db/k2bac/library/bacteria/
total 85653782
-rw-r--r--. 1 idoerg its-hpc-nova-idoerg    59034890 Aug 18 20:11 assembly_summary.txt
-rw-r--r--. 1 idoerg its-hpc-nova-idoerg     1928376 Aug 18 20:11 manifest.txt
-rw-r--r--. 1 idoerg its-hpc-nova-idoerg     2104958 Aug 18 21:16 prelim_map.txt
-rw-r--r--. 1 idoerg its-hpc-nova-idoerg 87653817104 Aug 19 00:34 library.fna
-rw-r--r--. 1 idoerg its-hpc-nova-idoerg           0 Aug 19 00:34 library.fna.masked

Any ideas on how to continue the process / salvage the database without redoing everything? Seems like the masking failed.

idoerg commented 4 years ago

Also, it seems like the taxonomy file from NCBI cannot is not there? This seems to have cropped up in the past: https://github.com/DerrickWood/kraken/issues/132 (Seems to have not yet been fixed in the Conda version of kraken2)

$ kraken2-build --db /home/idoerg/work/oxymice/db2 --download-taxonomy
Downloading nucleotide est accession to taxon map...rsync: link_stat "/taxonomy/accession2taxid/nucl_est.accession2taxid.gz" (in pub) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [Receiver=3.1.2]

The file nucl_est.accession2taxid.gz does not seem to be in /pub/taxonomy/accession2taxid/


Index of /pub/taxonomy/accession2taxid/
Name | Size | Date Modified
-- | -- | --
README | 3.0 kB | 8/9/20, 11:21:00 AM
dead_nucl.accession2taxid.gz | 167 MB | 8/9/20, 11:21:00 AM
dead_nucl.accession2taxid.gz.md5 | 63 B | 8/9/20, 11:21:00 AM
dead_prot.accession2taxid.gz | 685 MB | 8/9/20, 11:21:00 AM
dead_prot.accession2taxid.gz.md5 | 63 B | 8/9/20, 11:21:00 AM
dead_wgs.accession2taxid.gz | 471 MB | 8/9/20, 11:21:00 AM
dead_wgs.accession2taxid.gz.md5 | 62 B | 8/9/20, 11:21:00 AM
nucl_gb.accession2taxid.gz | 1.8 GB | 8/9/20, 11:22:00 AM
nucl_gb.accession2taxid.gz.md5 | 61 B | 8/9/20, 11:22:00 AM
nucl_wgs.accession2taxid.gz | 3.4 GB | 8/9/20, 11:23:00 AM
nucl_wgs.accession2taxid.gz.md5 | 62 B | 8/9/20, 11:23:00 AM
pdb.accession2taxid.gz | 3.4 MB | 8/9/20, 11:23:00 AM
pdb.accession2taxid.gz.md5 | 57 B | 8/9/20, 11:23:00 AM
prot.accession2taxid.gz | 5.8 GB | 8/9/20, 11:25:00 AM
prot.accession2taxid.gz.md5 | 58 B | 8/9/20, 11:25:00 AM
jenniferlu717 commented 4 years ago

How do you know that the masking didnt complete correctly? Normally the .masked file isnt generated until the masking if finished (its supposed to be an empty file).

Otherwise, you can run ./mask_low_complexity.sh library/human/ to redo the masking.

regarding the missing taxonomy files, you only need the nucl_gb/nucl_wgs accession2taxid.map files from https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/ (and just gunzip both)

Then download https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz and run tar zxf taxdump.tar.gz

idoerg commented 4 years ago

Thanks!

The bacteria custom db is now built. Also, for the standard database I had to modify the code as per: https://github.com/DerrickWood/kraken/issues/114#issuecomment-610912961 (the "na" bug from 2 years ago) and also download the missing taxonomy files. Since neither of these fixes didn't make it to the Conda version of Kraken2, perhaps it would be a good idea to add a comment in the manual?

Puumanamana commented 4 years ago

I also have the same issue when trying to download the viral database:

$ kraken2-build --download-library viral --db krakendb-viral --threads 10

Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 10388 projects (13011 sequences, 387.10 Mbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.

And then if I try to use kraken2 directly:

$ kraken2 --db krakendb-viral --threads 20 --output taxonomy.txt assembly.fasta

kraken2: database ("./krakendb-viral") does not contain necessary file taxo.k2d
jenniferlu717 commented 4 years ago

@idoerg The Kraken2 authors (including myself) are not the ones keeping the conda version up to date so we don't include any information about that in our manual. Hopefully we will have a stable version of Kraken 2 out that has the fix for 'na' and and then our extremely helpful friends that do keep the conda version up to date can fix that as well.

@Puumanamana your issue is more that you did not download the taxonomy as well. Simply run kraken2-build --download-taxonomy --db krakendb-viral and then you can build the database using kraken2-build --build --db krakendb-viral --threads 10 before running kraken2 itself.

Puumanamana commented 4 years ago

@jenniferlu717 : Thank you, I missed this part in the documentation !

jenniferlu717 commented 4 years ago

I'm going to close this issue for now. If you continue to have problems, please open a new issue.