bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
259 stars 68 forks source link

kaiju-makedb for mar database #223

Open ThijsSt opened 2 years ago

ThijsSt commented 2 years ago

Hi, I've been trying to set up the mar database for a metagenomics project, but I've been running into two odd issues:

  1. Sometimes, when installing the database (I've found that this goes with all the databases), you get the following error: `\033[0;32mDownloading taxdump.tar.gz\033[0m 2022-05-16 12:16:55 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [1800] -> ".listing" [1] 2022-05-16 12:17:03 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [58436660] -> "taxdump.tar.gz" [1] \033[0;32mExtracting taxdump.tar.gz\033[0m

gzip: stdin: invalid compressed data--format violated tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now`

This does not always happen, but it is kind of random and I'm not sure if anything can be done.

  1. When the download works, something odd happens and I get the following error message: \033[0;32mExtracting taxdump.tar.gz\033[0m \033[0;32mDownloading MarRef metadata from MMP (databasesapi.sfb.uit.no)\033[0m \033[0;32mCurrent MarRef version is: 1.7\033[0m % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 75801 0 75801 0 0 89177 0 --:--:-- --:--:-- --:--:-- 277k \033[0;32mDownloading MarRef reference genomes from the Marine Metagenomics Portal using 5 threads\033[0m mv: cannot stat ‘mar/source/public.sfb.uit.no/MarRef/genomes/*’: No such file or directory

I've looked at the kaiju-makedb script, and I think the jq step silently fails, but can you maybe help me figure out how to bypass this error?

Thanks

Thijs

pmenzel commented 2 years ago

It looks like that your downloaded files are corrupted or not properly downloaded, so then they are not found when it says cannot stat ‘mar/source/public.sfb.uit.no/MarRef/genomes/*’

ThijsSt commented 2 years ago

Yes, I've been going over the code in the kaiju-makedb script with my admittedly limited experience in bioIT, and it seems that the download from the MarRef database somehow does not work. f [ "$DB" = "mar" -o "$DB" = "mar_ref" -o "$DB" = "mar_db" ] then mkdir -p $DB/source if [ $index_only -eq 0 ] then if [ $DL -eq 1 ] then if [ "$DB" = "mar" -o "$DB" = "mar_ref" ] then echo "${GREEN}Downloading MarRef metadata from MMP (databasesapi.sfb.uit.no)${NC}" MARREF_VERSION=$(curl -Ls -o /dev/null -w %{url_effective} https://databasesapi.sfb.uit.no/rest/v1/MarRef/records | grep -Po 'ver=\K\d+\.\d+') echo "${GREEN}Current MarRef version is: ${MARREF_VERSION}${NC}" curl "https://databasesapi.sfb.uit.no/rpc/v1/MarRef/graphs?x%5Basmbl%3Asequences%5D=each&y_yName%5Btax%3Aorganism%5D=setR" -o $DB/MarRef.json -L [ -r $DB/MarRef.json ] || { echo -e "${RED}Missing file MarRef.json${NC}"; exit 1; } MARREF_COUNT=$(jq .graph[].x $DB/MarRef.json | wc -l) All works fine, but then when I get to jq .graph[].x $DB/MarRef.json | tr -d '"' | xargs -I{} -P $parallelDL wget -P $DB/source -q -np --recursive https://public.sfb.uit.no/MarRef/genomes/{}/protein.faa || true

Something weird happens. jq.graph[].x $DB/MarRef.json | tr -d '"' This part works fine, and I started to suspect xargs -I{} -P $parallelDL wget -P $DB/source -q -np --recursive https://public.sfb.uit.no/MarRef/genomes/{}/protein.faa || true

So I instead of 'true' entered echo ERROR : xargs -I{} -P $parallelDL wget -P $DB/source -q -np --recursive https://public.sfb.uit.no/MarRef/genomes/{}/protein.faa || echo ERROR

Which, when running the whole command does indeed only give you an ERROR, meaning that the command somehow fails. I'll try to figure out where it goes wrong, but any thoughts are much appreciated as this is all very new to me

spencerlong1 commented 1 year ago

im having this exact issue- did it get solved in the end?