fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
218 stars 44 forks source link

Error for viral-neighbors DB download #87

Open jzrapp opened 2 years ago

jzrapp commented 2 years ago

Hi, I'm trying to download the viral-neighbors db from NCBI, but the download halts with this message. Could you help on that?

krakenuniq-download --db $DBDIR viral-neighbors  --threads 10
Downloading viral neighbors.
$DBDIR/taxonomy/nucl_gb.accession2taxid.gz          check [2.00 GB]
$DBDIR/taxonomy/nucl_gb.accession2taxid.sorted      check [4.24 GB]
Reading names file ...
Downloading {"header":{"type":"esearch sequences into DBDIR/library/viral/Neighbors.
Use of uninitialized value $querykey in concatenation (.) or string at /opt/miniconda3/envs/anvio-7.1/bin/krakenuniq-download line 570.
Use of uninitialized value $webenv in concatenation (.) or string at /opt/miniconda3/envs/anvio-7.1/bin/krakenuniq-download line 570.
query_key=&webenv=
Argument "{"header":{"type":"esearch" isn't numeric in numeric lt (<) at /opt/miniconda3/envs/anvio-7.1/bin/krakenuniq-download

Thanks a lot!

clescoat commented 2 years ago

TL;DR: I attached the modified script krakenuniq-download.

The problem seems to be with the format of the json files returned by NCBI esearch. The krakenuniq-download script wants three informations from it : the number of returned hits ($n_res), and the $querykey and $webenv params to download the results with NCBI efetch.

The grep command used to get the three parameters doesn't work, hence : $n_res = "{"header":{"type":"esearch" $querykey and $webenv not unintialized.

So, I modified the script to get the right information. Here are the modified lines of code. I have absolutely no experience with perl so my hack may seem clumsy, if you have a more elegant please correct it !

my $n_res=`grep -Po '"count":"[0-9]+' $esearch_file | head -1 | grep -o '[[:digit:]]*'`; 
chomp $n_res;

print STDERR "Downloading $n_res sequences into $nbr_dir.\n"; #not modified
return if (!defined $n_res || $n_res eq 0); #not modified

# Step 2: Download FASTAs (or other formats), 10k at a time
my $querykey=`grep -Po '"querykey":"[[:digit:]]*"' $esearch_file | grep -o "[[:digit:]]*"`;
chomp $querykey;

my $webenv =`grep -Po '(?<=webenv":")(.*)(?=","idlist)' $esearch_file`;
chomp $webenv;

The esearch_file looks like this :

{"header":{"type":"esearch","version":"0.3"},"esearchresult":{"count":"191228","retmax":"1","retstart":"0","querykey":"1","webenv":"MCID_62698c86771ad064f637512f","idlist":["2131322834"],"translationset":[{"from":"viruses[organism]","to":"\"Viruses\"[Organism]"},{"from":"cellular organisms[orgn]","to":"\"cellular organisms\"[Organism]"}],"translationstack":[{"term":"\"Viruses\"[Organism]","field":"Organism","count":"8537507","explode":"Y"},{"term":"\"cellular organisms\"[Organism]","field":"Organism","count":"440430254","explode":"Y"},"NOT",{"term":"wgs[prop]","field":"prop","count":"114077876","explode":"N"},"NOT",{"term":"gbdiv syn[prop]","field":"prop","count":"565473","explode":"N"},"NOT",{"term":"nuccore genome samespecies[filter]","field":"filter","count":"233336","explode":"N"},{"term":"complete[title]","field":"title","count":"9205767","explode":"N"},"AND",{"term":"unverified[title]","field":"title","count":"425833","explode":"N"},"NOT","GROUP","AND"],"querytranslation":"\"Viruses\"[Organism] NOT \"cellular organisms\"[Organism] NOT wgs[prop] NOT gbdiv syn[prop] AND (nuccore genome samespecies[filter] AND complete[title] NOT unverified[title])","warninglist":{"phrasesignored":["and"],"quotedphrasesnotfound":[],"outputmessages":[]}}}

Here are my logs with this correction :

(env) $ krakenuniq-download --threads 10 --db DBDIR refseq/viral/Any viral-neighbors

Downloading assembly summary file for viral genomes, and filtering to assembly level Any.
WARNING: DBDIR/library/viral already exists - potentially overwriting files.
 Downloading viral genomes:  11854/11854 ...   Found 11854 files. Skipped download of 11854 files that already existed.
Downloading viral neighbors.
Not fetching https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz - file DBDIR/taxonomy/nucl_gb.accession2taxid.gz exists.
DBDIR/taxonomy/nucl_gb.accession2taxid.gz          check [2.03 GB]
DBDIR/taxonomy/nucl_gb.accession2taxid.sorted      check [4.32 GB]
Reading names file ...
Downloading 191228 sequences into DBDIR/library/viral/Neighbors.
query_key=1&webenv=MCID_626a95838302021f8e7861e6
  Downloading sequences 1 to 10000 of 191228 ...
 done 
  Downloading sequences 10001 to 20000 of 191228 ...
 done 
  Downloading sequences 30001 to 40000 of 191228 ...
 done 
  Downloading sequences 70001 to 80000 of 191228 ...
 done 
  Downloading sequences 150001 to 160000 of 191228 ...
 done 
(env)$ 

But wait, there is more ! We are downloading FASTA files by 10k batches, but I didn't understand why we jumped from batch 10k -> 20k to 30k -> 40k etc., so I modified the following lines as well :

Before (add the 1000, 2000, 40000, 80000... at the end of each loop): $retstart += $curr_retmax;

After (add 10 000 at the end of each loop) : $retstart += $retmax;

And the resulting logs :

Downloading assembly summary file for viral genomes, and filtering to assembly level Any.
WARNING: DBDIR/library/viral already exists - potentially overwriting files.
 Downloading viral genomes:  11854/11854 ...   Found 11854 files. Skipped download of 11854 files that already existed.
Downloading viral neighbors.
Not fetching https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz - file DBDIR/taxonomy/nucl_gb.accession2taxid.gz exists.
DBDIR/taxonomy/nucl_gb.accession2taxid.gz          check [2.03 GB]
DBDIR/taxonomy/nucl_gb.accession2taxid.sorted      check [4.32 GB]
Reading names file ...
DBDIR/library/viral/Neighbors/esearch_res.jsonDownloading 191228 sequences into DBDIR/library/viral/Neighbors.
query_key=1&webenv=MCID_626a95838302021f8e7861e6
  Downloading sequences 1 to 10000 of 191228 ...
 done 
  Downloading sequences 10001 to 20000 of 191228 ...
 done 
  Downloading sequences 20001 to 30000 of 191228 ...
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=taxonomy&id=AC_000192
AC_000192: 11144 [/tmp/VL2eJBOTdq]
 done 
  Downloading sequences 30001 to 40000 of 191228 ...
 done 
  Downloading sequences 40001 to 50000 of 191228 ...
 done 
  Downloading sequences 50001 to 60000 of 191228 ...
 done 
  Downloading sequences 60001 to 70000 of 191228 ...
 done 
  Downloading sequences 70001 to 80000 of 191228 ...
 done 
  Downloading sequences 80001 to 90000 of 191228 ...
 done 
  Downloading sequences 90001 to 100000 of 191228 ...
 done 
  Downloading sequences 100001 to 110000 of 191228 ...
 done 
  Downloading sequences 110001 to 120000 of 191228 ...
 done 
  Downloading sequences 120001 to 130000 of 191228 ...
 done 
  Downloading sequences 130001 to 140000 of 191228 ...
 done 
  Downloading sequences 140001 to 150000 of 191228 ...
 done 
  Downloading sequences 150001 to 160000 of 191228 ...
 done 
  Downloading sequences 160001 to 170000 of 191228 ...
 done 
  Downloading sequences 170001 to 180000 of 191228 ...
 done 
  Downloading sequences 180001 to 190000 of 191228 ...
 done 
  Downloading sequences 190001 to 191228 of 191228 ...
 done 

krakenuniq-download.zip

jsonProgram commented 2 years ago

Thanks clescoat!

luigra commented 2 years ago

You saved my day!!!

Thanks a lot @clescoat !

Why don't you push this important upgrade to the current version of the script or ask @fbreitwieser or @DerrickWood to do it ?

salzberg commented 2 years ago

Thanks for this help, @clescoat! Just FYI to everyone on this thread: Florian Breitwieser left my lab several years ago, and he no longer maintains Kraken-unique, but we've had many recent improvements led by Christopher Pockrandt and Aleksey Zimin. Aleksey is still an active member of the lab (a senior scientist) and Christopher has moved back to Germany, but continues to help maintain the code. The recent major upgrade (allowing the use of small-memory computers despite a very large DB) was developed by him.

You saved my day!!!

Thanks a lot @clescoat !

Why don't you push this important upgrade to the current version of the script or ask @fbreitwieser or @DerrickWood to do it ?

alekseyzimin commented 2 years ago

I will integrate the update now. Thank you, @clescoat