Open oatesa opened 3 years ago
It feels like the file assembly_summary.txt or assembly_summary_filtered.txt is wrong(missing some columns, or some tabs become spaces). Does the same issue happen to your separate job?
same issue if run separately but its only occurring with the bacteria and only with 5 genomes, working fine with virus, fungi and archaea
any updates on this? colleagues are having tha same issue when trying to download bacterial genomes
I could not reproduce this error on our server. What is the bash version on your system?
@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?
@stephaniepillay @mourisl no we didn't solve the issue, the work around was to change the order of the download with bacteria being last on the list so the job would run but accept that those few sequences wouldn't not download. For me it was 5 sequences which didnt seem too much of an issue in the grander scheme of the bacterial sequences but others had around 50 that have failed. These individuals have repeated the download step for bacteria several times and this number reduced
@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?
@mourisl bash, version 4.2.46
I could not reproduce this error on our server. What is the bash version on your system?
@mourisl bash, version 4.2.46
I'm getting this exact same issue with make p+h+v
. A handful of the bacterial downloads fail with:
"Error downloading na/654 na_genomic.fna.gz!" "extra operand ‘.gz’ Try 'basename --help' for more information."
This then crashes the rest of the build.
Bash version: 4.2.46(2)-release Linux version: 4.14.248-189.473.amzn2.x86_64
Did anyone ever find a solution? If not, is there a recommended workaround?
Have the same error looking for solution
Hi everyone,
In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing if [[ "$DOMAIN" == "viral" ]]; then
into if [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then
.
@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2
should be placed outside (before) the "if" statement.
That's all, I hope this will be helpful.
@gbikpi Thanks for testing! I will update the script and merge it to the master.
The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.
The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.
Thanks for updating,
but unfortunately, still there is something wrong with centrifuge-download
.
I tried make
it from master again. but I got this for viral (bacteria works fine):
basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
cat: ./viral/: Is a directory
Hello,
I can confirm there's still the same error for viral genomes.
we recently went though downloading/building an index again for a new student a few of the bacterial genomes failed (20 didn't download). This time we had the issue everyone else was having with the viral genome with it completely failing
I am having exactly the same issue as @oatesa describes. Is there a workaround possible?
Note: no error message is displayed for not downloading the last 20 bacterial genomes
Hi everyone,
In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing
if [[ "$DOMAIN" == "viral" ]]; then
intoif [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then
.@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line
echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2
should be placed outside (before) the "if" statement.That's all, I hope this will be helpful.
Hello,
I also encountered the same error while downloading the virus genome especially. As mentioned above, I replaced the centrifuge-download
according to the https://raw.githubusercontent.com/DaehwanKimLab/centrifuge/viral_download/centrifuge-download. Now, It runs correctly.
Hello,
I am also running into the same problem (Error downloading....basename: extra operand '_genomic.fna.gz') as others with the virus genomes with any command
make v
make p
make p_compressed+h+v
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map
centrifuge-download -o library -m -d "viral" refseq > seqid2taxid.map
etc
I tried the fix listed by others using the updated centrifuge-download
linked above by @mourisl which apparently recently worked for @omrctnr, and also changing the line in question. In my summary files, the domain is in field 20.
curl v 7.82.0 bash v 4.4.19
Hope this is solveable
I'm having the same problem when running:
cd indices
make p_compressed+h+v
...
mkdir -p reference-sequences
[[ -d tmp_p_compressed+h+v ]] && rm -rf tmp_p_compressed+h+v; mkdir -p tmp_p_compressed+h+v
Downloading and dust-masking viral
centrifuge-download -o tmp_p_compressed+h+v -m -a "Any" -d "viral" -P 1 refseq > \
tmp_p_compressed+h+v/all-viral-any_level.map
Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ...
basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
gzip: tmp_p_compressed+h+v/viral/.gz: unknown suffix -- ignored
....
. I'm using centrifuge/1.0.4. Looking at the script centrifuge-download
I can see this section:
if [[ "$DOMAIN" == "viral" ]]; then
## Wrong columns in viral assembly summary files - the path is sometimes in field 20, sometimes 21
cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" | \
sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/' | \
sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
else
echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2
cut -f "$TAXID_FIELD,$FTP_PATH_FIELD" "$ASSEMBLY_SUMMARY_FILE" | sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
fi
echo >&2
I think the problem is in this sed
command:
sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/'
Is looking for a string with ftp:
in the output of cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE"
which for me, it looks somthing like:
$ head -n 1 tmp_p_compressed+h+v/viral/assembly_summary_filtered.txt | cut -f 6,20,21
10243 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174
So I changed the command to search for https:
instead, sed 's/^\(.*\)\t\(https:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(https:.*\)/\1\t\2/'
and it seems to work. But I'm not sure if this would break anything else. Is there a way to sanity check the files were downloaded correctly?
We are recently decided to update our index so started from scratch (deleting old/dated index etc).
We ran centrifuge-download -o library -m -d "archaea,bacteria,viral,fungi" refseq >> seqid2taxid.map. Archaea was successful, but we received errors with bacteria
4247/19206basename: extra operand '.gz' Try 'basename --help' for more information.
Error downloading na/562 na_genomic.fna.gz! basename: extra operand '.gz' Try 'basename --help' for more information.
overall this related to 5 genomes (stopped 5 short of the total) and did not progress to viral or fungi index download. I have ran these are a separate job (currently running) but wondered what this error could relate to and how to correct it.
Thanks in advance