bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
272 stars 66 forks source link

Problem downloading reference set data #72

Closed lakhujanivijay closed 6 years ago

lakhujanivijay commented 6 years ago

Hi

I am trying to download the reference set data for standalone KAIJU by following the steps on this link

My command is

> [root@headnode kaiju-db]# sh /opt/app/kaiju-master/bin/makeDB.sh -e

However, after waiting for a long time, I am getting below error:

> Downloading file taxdump.tar.gz
> 2018-05-31 16:12:36 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [3107] -> ".listing" [1]
> 2018-05-31 16:26:19 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [43320110] -> "taxdump.tar.gz" [1]
> Extracting file taxdump.tar.gz
> Downloading file nr.gz
> 2018-05-31 16:26:34 URL: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz [3570] -> ".listing" [1]
> 2018-05-31 16:38:36 (69388554 GB/s) - Control connection closed.
> 2018-05-31 17:17:33 (209711105 GB/s) - Control connection closed.
> 2018-05-31 17:23:48 (32534227 GB/s) - Control connection closed.
> 2018-05-31 17:29:45 (36804900 GB/s) - Control connection closed.
> 2018-05-31 17:39:57 (49856305 GB/s) - Control connection closed.
> 2018-05-31 17:42:12 (7106736 GB/s) - Control connection closed.
> 2018-05-31 17:43:28 (4528448 GB/s) - Control connection closed.
> 2018-05-31 17:53:59 (50762311 GB/s) - Control connection closed.
> 2018-05-31 17:57:51 (18521041 GB/s) - Control connection closed.
> 2018-05-31 18:05:18 (41948155 GB/s) - Control connection closed.
> 2018-05-31 18:55:49 (354117081 GB/s) - Control connection closed.
> 2018-05-31 18:59:26 (12266457 GB/s) - Control connection closed.
> 2018-06-01 00:52:03 (2632570729 GB/s) - Control connection closed.
> 2018-06-01 09:54:51 (5611940354 GB/s) - Control connection closed.
> 2018-06-01 10:12:20 (97619206 GB/s) - Control connection closed.
> 2018-06-01 10:23:20 (59838086 GB/s) - Control connection closed.
> 2018-06-01 10:34:41 (69431707 GB/s) - Control connection closed.
> 2018-06-01 10:41:54 (41902825 GB/s) - Control connection closed.
> 2018-06-01 11:06:42 (148759112 GB/s) - Control connection closed.
> 2018-06-01 11:08:03 (3135666 GB/s) - Control connection closed.
> Downloading file prot.accession2taxid.gz
> 2018-06-01 11:08:06 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz [1625] -> ".listing" [1]
> 2018-06-01 11:09:19 (2830356 GB/s) - Control connection closed.
> 2018-06-01 11:10:30 (2239928 GB/s) - Control connection closed.
> 2018-06-01 11:13:34 (13415545 GB/s) - Control connection closed.
> 2018-06-01 11:14:48 (4304588 GB/s) - Control connection closed.
> 2018-06-01 11:17:06 (8913070 GB/s) - Control connection closed.
> 2018-06-01 11:25:01 (41147426 GB/s) - Control connection closed.
> 2018-06-01 11:30:19 (27693927 GB/s) - Control connection closed.
> 2018-06-01 11:34:12 (17502218 GB/s) - Control connection closed.
> 2018-06-01 11:41:44 (41670352 GB/s) - Control connection closed.
> 2018-06-01 11:46:38 (22663817 GB/s) - Control connection closed.
> 2018-06-01 11:57:17 (54141790 GB/s) - Control connection closed.
> 2018-06-01 11:58:39 (3584459 GB/s) - Control connection closed.
> 2018-06-01 11:59:58 (3198773 GB/s) - Control connection closed.
> 2018-06-01 12:01:21 (3492758 GB/s) - Control connection closed.
> 2018-06-01 12:02:47 (3387570 GB/s) - Control connection closed.
> 2018-06-01 12:04:11 (3808320 GB/s) - Control connection closed.
> 2018-06-01 12:05:32 (2875119 GB/s) - Control connection closed.
> 2018-06-01 12:06:54 (2732173 GB/s) - Control connection closed.
> 2018-06-01 12:08:16 (2980307 GB/s) - Control connection closed.
> 2018-06-01 12:09:36 (2616197 GB/s) - Control connection closed.
> Unpacking prot.accession2taxid.gz
>
> gzip: prot.accession2taxid.gz: unexpected end of file
pmenzel commented 6 years ago

Looks like the downloads of the files from the NCBI FTP server were interrupted, thus they cannot be unpacked properly. The file sizes should be around 3.6G for prot.accession2taxid.gz and 36G for nr.gz.

lakhujanivijay commented 6 years ago

Does that mean a "network hiccup" ?

pmenzel commented 6 years ago

I guess you could call it like that if you want to.. :) I have never seen those lines

2018-05-31 16:38:36 (69388554 GB/s) - Control connection closed.

So I assume it has something to do with your network connectivity..

lakhujanivijay commented 6 years ago

HAHAHA :+1: I liked that comment ...!!

I will check my internet. Thank you so much Peter! By the way, I liked the tool

pmenzel commented 6 years ago

Nice to hear!

lakhujanivijay commented 6 years ago

Hi Peter

It's me once again to bug you! I did not want to open a new issue , hence, posting here.

I was finally able to download the data , however, I am not able to find the .fmi file. My folder looks like


citations.dmp
convert_mar_to_kaiju.py
delnodes.dmp
division.dmp
gbk2faa.pl
gc.prt
gencode.dmp
LICENSE
makeDB.sh
merged.dmp
names.dmp
nodes.dmp
nr.gz
prot.accession2taxid
prot.accession2taxid.gz
README.md
readme.txt
taxdump.tar.gz
taxonlist.tsv

The database size (as you mentioned above matches)


[corona] $ du -sh prot.accession2taxid.gz 
3.6G    prot.accession2taxid.gz
[corona]$ du -sh nr.gz 
36G nr.gz

Appreciate your time! Let me know if this is not the appropriate place to ask such questions; do suggest an alternative in that case.

Regards Vijay

pmenzel commented 6 years ago

Ok, so you ran makeDB.sh -e and it aborted so you didn't get the kaiju_db_nr_euk.fmi file? Then probably you need more RAM to complete it. It's a very big database, so probably needs 100GB RAM.

You can also download it from here: http://kaiju.binf.ku.dk/server (blue box)

lakhujanivijay commented 6 years ago

Hi Peter

I hope you are doing well. I have downloaded the file you suggested above from this the link.

Here are the data sizes both files , compressed and extracted ones!

[headnode new]$ du -sh *
49G kaiju_index_nr_euk
28G kaiju_index_nr_euk.gz

After extraction, what I get a single binary file; however, I thought I will get a folder with few files, .dmp and .fmi files. Could you please help?

Regards Vijay Lakhujani

pmenzel commented 6 years ago

It looks like you unpacked the file with the wrong command. You need to use:

tar xzf kaiju_index_nr_euk.tgz

which will give you the files

kaiju_index_nr_euk.fmi
names.dmp
nodes.dmp
lakhujanivijay commented 6 years ago

Oh! How foolish of me. Indeed I made a mistake!

Thanks man, you are great..!! Let me try running it now.