`krakenuniq-download` is stochastic #174

Open xapple opened 3 months ago

xapple commented 3 months ago

Running the following command from the manual:

$ krakenuniq-download -db DBDIR refseq/viral/Any viral-neighbors

Produces mixed results. Sometimes it produces an error, but not always. I needed to relaunch the exact same command three times before it would complete successfully. I believe it has a stochastic behavior because of the amount of HTTP connections it makes. A small fraction of the connections may fail due to proxies or network congestion, and the script doesn't wrap them in a retry. This is the error message:

(krkn) user@cluster test $ krakenuniq-download --db DBDIR refseq/viral/Any viral-neighbors
Environment contains multiple differing definitions for 'cluster'.
Using value from 'CLUSTER' (xxxx) and ignoring 'cluster' (xxxx) at ~/miniconda3/envs/krkn/lib/perl5/site_perl/LWP/UserAgent.pm line 1134.
Environment contains multiple differing definitions for 'site'.
Using value from 'SITE' (xxxx) and ignoring 'site' (xxxx) at ~/miniconda3/envs/krkn/lib/perl5/site_perl/LWP/UserAgent.pm line 1134.
Downloading assembly summary file for viral genomes, and filtering to assembly level Any.
 Downloading viral genomes:  12254/14992 ... Error fetching https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/856/685/GCF_000856685.1_ViralProj15059/GCF_000856685.1_ViralProj15059_genomic.fna.gz. Is curl installed?
 Downloading viral genomes:  14992/14992 ...   Found 14992 files.
Downloading viral neighbors.
Downloading DBDIR/taxonomy/nucl_gb.accession2taxid.gz [curl -g 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz' -o 'DBDIR/taxonomy/nucl_gb.accession2taxid.gz'] ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2301M  100 2301M    0     0  48.5M      0  0:00:47  0:00:47 --:--:-- 49.0M
 done (48s)
DBDIR/taxonomy/nucl_gb.accession2taxid.gz          check [2.25 GB]
Sorting maping file (will take some time) [gunzip -c DBDIR/taxonomy/nucl_gb.accession2taxid.gz | cut -f 1,3 > DBDIR/taxonomy/nucl_gb.accession2taxid.sorted.tmp && sort --parallel 5 -T DBDIR/taxonomy DBDIR/taxonomy/nucl_gb.accession2taxid.sorted.tmp > DBDIR/taxonomy/nucl_gb.accession2taxid.sorted && rm DBDIR/taxonomy/nucl_gb.accession2taxid.sorted.tmp] ... done (4m54s)
DBDIR/taxonomy/nucl_gb.accession2taxid.sorted      check [4.81 GB]
Reading names file ...
Downloading DBDIR/taxonomy/taxdump.tar.gz [curl -g 'https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz' -o 'DBDIR/taxonomy/taxdump.tar.gz'] ...
Download taxdump.tar.gz  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 62.2M  100 62.2M    0     0  10.6M      0  0:00:05  0:00:05 --:--:-- 13.4M
 done (6s)
DBDIR/taxonomy/taxdump.tar.gz                      check [62.24 MB]
Storing taxonomy timestamp [date > DBDIR/taxonomy/timestamp] ... done (0s)
Extracting nodes file [tar -C DBDIR/taxonomy -zxvf DBDIR/taxonomy/taxdump.tar.gz nodes.dmp > /dev/null] ... done (2s)
DBDIR/taxonomy/nodes.dmp                           check [186.48 MB]
Extracting names file [tar -C DBDIR/taxonomy -zxvf DBDIR/taxonomy/taxdump.tar.gz names.dmp > /dev/null] ... done (3s)
DBDIR/taxonomy/names.dmp                           check [234.57 MB]
DBDIR/library/viral/Neighbors/esearch_res.jsonDownloading 188670 sequences into DBDIR/library/viral/Neighbors.
  Downloading sequences 1 to 10000 of 188670 ... done
  Downloading sequences 10001 to 20000 of 188670 ... done
  Downloading sequences 20001 to 30000 of 188670 ...https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=taxonomy&id=AC_000192
Error fetching https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=taxonomy&id=AC_000192. Is curl installed?
(krkn) user@cluster test $