DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

centrifuge-download error extra operand '.gz' #201

Open oatesa opened 3 years ago

oatesa commented 3 years ago

We are recently decided to update our index so started from scratch (deleting old/dated index etc).

We ran centrifuge-download -o library -m -d "archaea,bacteria,viral,fungi" refseq >> seqid2taxid.map. Archaea was successful, but we received errors with bacteria

4247/19206basename: extra operand '.gz' Try 'basename --help' for more information.

Error downloading na/562 na_genomic.fna.gz! basename: extra operand '.gz' Try 'basename --help' for more information.

overall this related to 5 genomes (stopped 5 short of the total) and did not progress to viral or fungi index download. I have ran these are a separate job (currently running) but wondered what this error could relate to and how to correct it.

Thanks in advance

mourisl commented 3 years ago

It feels like the file assembly_summary.txt or assembly_summary_filtered.txt is wrong(missing some columns, or some tabs become spaces). Does the same issue happen to your separate job?

oatesa commented 3 years ago

same issue if run separately but its only occurring with the bacteria and only with 5 genomes, working fine with virus, fungi and archaea

oatesa commented 3 years ago

any updates on this? colleagues are having tha same issue when trying to download bacterial genomes

mourisl commented 3 years ago

I could not reproduce this error on our server. What is the bash version on your system?

stephaniepillay commented 3 years ago

@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?

oatesa commented 3 years ago

@stephaniepillay @mourisl no we didn't solve the issue, the work around was to change the order of the download with bacteria being last on the list so the job would run but accept that those few sequences wouldn't not download. For me it was 5 sequences which didnt seem too much of an issue in the grander scheme of the bacterial sequences but others had around 50 that have failed. These individuals have repeated the download step for bacteria several times and this number reduced

oatesa commented 3 years ago

@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?

@mourisl bash, version 4.2.46

oatesa commented 3 years ago

I could not reproduce this error on our server. What is the bash version on your system?

@mourisl bash, version 4.2.46

afkoeppeleri commented 2 years ago

I'm getting this exact same issue with make p+h+v. A handful of the bacterial downloads fail with:

"Error downloading na/654 na_genomic.fna.gz!" "extra operand ‘.gz’ Try 'basename --help' for more information."

This then crashes the rest of the build.

Bash version: 4.2.46(2)-release Linux version: 4.14.248-189.473.amzn2.x86_64

Did anyone ever find a solution? If not, is there a recommended workaround?

xiaoyunguo commented 2 years ago

Have the same error looking for solution

gbikpi commented 2 years ago

Hi everyone,

In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing if [[ "$DOMAIN" == "viral" ]]; then into if [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then.

@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2 should be placed outside (before) the "if" statement.

That's all, I hope this will be helpful.

mourisl commented 2 years ago

@gbikpi Thanks for testing! I will update the script and merge it to the master.

mourisl commented 2 years ago

The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.

poursalavati commented 2 years ago

The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.

Thanks for updating, but unfortunately, still there is something wrong with centrifuge-download. I tried make it from master again. but I got this for viral (bacteria works fine):

basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
cat: ./viral/: Is a directory
domenico-simone commented 2 years ago

Hello,

I can confirm there's still the same error for viral genomes.

oatesa commented 2 years ago

we recently went though downloading/building an index again for a new student a few of the bacterial genomes failed (20 didn't download). This time we had the issue everyone else was having with the viral genome with it completely failing

CuypersBart commented 2 years ago

I am having exactly the same issue as @oatesa describes. Is there a workaround possible?

CuypersBart commented 2 years ago

Note: no error message is displayed for not downloading the last 20 bacterial genomes

omrctnr commented 2 years ago

Hi everyone,

In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing if [[ "$DOMAIN" == "viral" ]]; then into if [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then.

@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2 should be placed outside (before) the "if" statement.

That's all, I hope this will be helpful.

Hello,

I also encountered the same error while downloading the virus genome especially. As mentioned above, I replaced the centrifuge-download according to the https://raw.githubusercontent.com/DaehwanKimLab/centrifuge/viral_download/centrifuge-download. Now, It runs correctly.

virocamp commented 2 years ago

Hello, I am also running into the same problem (Error downloading....basename: extra operand '_genomic.fna.gz') as others with the virus genomes with any command make v make p make p_compressed+h+v centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map centrifuge-download -o library -m -d "viral" refseq > seqid2taxid.map etc

I tried the fix listed by others using the updated centrifuge-download linked above by @mourisl which apparently recently worked for @omrctnr, and also changing the line in question. In my summary files, the domain is in field 20.

curl v 7.82.0 bash v 4.4.19

Hope this is solveable

josemunozc commented 1 year ago

I'm having the same problem when running:

cd indices
make p_compressed+h+v
...
mkdir -p reference-sequences
[[ -d tmp_p_compressed+h+v ]] && rm -rf tmp_p_compressed+h+v; mkdir -p tmp_p_compressed+h+v
Downloading and dust-masking viral
centrifuge-download -o tmp_p_compressed+h+v  -m -a "Any" -d "viral" -P 1 refseq > \
    tmp_p_compressed+h+v/all-viral-any_level.map
Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ...
basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
gzip: tmp_p_compressed+h+v/viral/.gz: unknown suffix -- ignored
....

. I'm using centrifuge/1.0.4. Looking at the script centrifuge-download I can see this section:

    if [[ "$DOMAIN" == "viral" ]]; then
      ## Wrong columns in viral assembly summary files - the path is sometimes in field 20, sometimes 21
      cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" | \
       sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/' | \
      sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
         tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
    else
      echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2
      cut -f "$TAXID_FIELD,$FTP_PATH_FIELD" "$ASSEMBLY_SUMMARY_FILE" | sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
         tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
    fi
    echo >&2

I think the problem is in this sed command:

sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/'

Is looking for a string with ftp: in the output of cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" which for me, it looks somthing like:

$ head -n 1 tmp_p_compressed+h+v/viral/assembly_summary_filtered.txt | cut -f 6,20,21
10243   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174

So I changed the command to search for https: instead, sed 's/^\(.*\)\t\(https:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(https:.*\)/\1\t\2/' and it seems to work. But I'm not sure if this would break anything else. Is there a way to sanity check the files were downloaded correctly?