Open kapsakcj opened 5 years ago
Update - I haven't dug into this issue any deeper, but I did encounter the same issue with the E. coli data set as well. All the checksums for the reads (fastq.gz) matched OK, but for the one .gbk file, the checksum did not match.
$ GenFSGopher.pl -o e-coli-O121-H19-sample-data/ --layout onedir --numcpus 4 Downloads/datasets/datasets/Escherichia_coli_1405WAEXK-1.tsv
make: Entering directory '/home/staphb/e-coli-O121-H19-sample-data'
wget -O tree.dnd 'http://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre'
esearch -db assembly -query 'GCA_000703365.1 NOT refseq[filter]' | elink -target nuccore -name assembly_nuccore_insdc | efetch -format gbwithparts > 2011C-3609.gbk
Downloading 2014C-3598_1.fastq.gz SRR1609861
Downloading 2014C-3599_1.fastq.gz SRR1609862
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1609861
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1609862
--2019-01-22 19:21:04-- http://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre
Resolving api.opentreeoflife.org (api.opentreeoflife.org)... 34.221.232.64
Connecting to api.opentreeoflife.org (api.opentreeoflife.org)|34.221.232.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre [following]
--2019-01-22 19:21:05-- https://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre
Connecting to api.opentreeoflife.org (api.opentreeoflife.org)|34.221.232.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 280 [text/plain]
Saving to: ‘tree.dnd’
tree.dnd 100%[============================================================>] 280 --.-KB/s in 0s
2019-01-22 19:21:05 (17.9 MB/s) - ‘tree.dnd’ saved [280/280]
Downloading 2014C-3600_1.fastq.gz SRR1609871
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1609871
Downloading 2014C-3656_1.fastq.gz SRR1610029
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610029
Read 519740 spots for SRR1609861
Written 519740 spots for SRR1609861
mv ./SRR1609861_1.fastq.gz 2014C-3598_1.fastq.gz
Downloading 2014C-3655_1.fastq.gz SRR1610028
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610028
Read 704756 spots for SRR1609862
Written 704756 spots for SRR1609862
mv ./SRR1609862_1.fastq.gz 2014C-3599_1.fastq.gz
Downloading 2014C-3840_1.fastq.gz SRR1610031
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610031
Read 685446 spots for SRR1610028
Written 685446 spots for SRR1610028
mv ./SRR1610028_1.fastq.gz 2014C-3655_1.fastq.gz
Downloading 2014C-3857_1.fastq.gz SRR1610033
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610033
Read 1847324 spots for SRR1610029
Written 1847324 spots for SRR1610029
mv ./SRR1610029_1.fastq.gz 2014C-3656_1.fastq.gz
Downloading 2014C-3907_1.fastq.gz SRR1610034
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610034
Read 1182546 spots for SRR1610031
Written 1182546 spots for SRR1610031
mv ./SRR1610031_1.fastq.gz 2014C-3840_1.fastq.gz
Downloading 2014C-3850_1.fastq.gz SRR1610032
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610032
Read 1597142 spots for SRR1609871
Written 1597142 spots for SRR1609871
mv ./SRR1609871_1.fastq.gz 2014C-3600_1.fastq.gz
esearch -db assembly -query 'GCA_000703365.1 NOT refseq[filter]' | elink -target nuccore -name assembly_nuccore_insdc | efetch -format fasta > 2011C-3609.fasta
mv ./SRR1609861_2.fastq.gz 2014C-3598_2.fastq.gz
mv ./SRR1609862_2.fastq.gz 2014C-3599_2.fastq.gz
mv ./SRR1609871_2.fastq.gz 2014C-3600_2.fastq.gz
mv ./SRR1610029_2.fastq.gz 2014C-3656_2.fastq.gz
mv ./SRR1610028_2.fastq.gz 2014C-3655_2.fastq.gz
mv ./SRR1610031_2.fastq.gz 2014C-3840_2.fastq.gz
Read 748141 spots for SRR1610034
Written 748141 spots for SRR1610034
mv ./SRR1610034_1.fastq.gz 2014C-3907_1.fastq.gz
mv ./SRR1610034_2.fastq.gz 2014C-3907_2.fastq.gz
Read 1143685 spots for SRR1610033
Written 1143685 spots for SRR1610033
mv ./SRR1610033_1.fastq.gz 2014C-3857_1.fastq.gz
mv ./SRR1610033_2.fastq.gz 2014C-3857_2.fastq.gz
Read 1124002 spots for SRR1610032
Written 1124002 spots for SRR1610032
mv ./SRR1610032_1.fastq.gz 2014C-3850_1.fastq.gz
mv ./SRR1610032_2.fastq.gz 2014C-3850_2.fastq.gz
rm -f sha256sum.txt
echo "342ea035eb6c14b230c6c31ebda5594fa22628e6a871e6cfc19c156a595e3dfc 2011C-3609.gbk" >> sha256sum.txt
echo "0264dc57015100fe41a9a9167145439eb383ea8216b49785a09ce5157293e080 2014C-3598_1.fastq.gz" >> sha256sum.txt
echo "f3d4eb398736b36babe38659a7b616a6938b0be3837d418d3097aace6ac32e5f 2014C-3598_2.fastq.gz" >> sha256sum.txt
echo "e703810bb7a488e99cfc7f72e31379cebd69f73fd660188723c3369bd0075aba 2014C-3599_1.fastq.gz" >> sha256sum.txt
echo "96a69d312a10798e0f28c4de3f17ca8050078f8f03d82238bd48600309941b74 2014C-3599_2.fastq.gz" >> sha256sum.txt
echo "5cb588797c9a3494d37dab3e2a20c4a82d78bccffd33fec8e38b6d122b03d058 2014C-3600_1.fastq.gz" >> sha256sum.txt
echo "9f4532d45c3cb1325af69bda5162f0ba36d3559c282855578c9ffd9206e3b474 2014C-3600_2.fastq.gz" >> sha256sum.txt
echo "fdbb2fdf7a294fb7831d2a75953c304c4da766816f51f8695b5eff2149176780 2014C-3656_1.fastq.gz" >> sha256sum.txt
echo "bfff096d4119bb7a69f9e6e70aac45eb9763437cdbeeb968aafd53bd955d8a40 2014C-3656_2.fastq.gz" >> sha256sum.txt
echo "af41ef9c5a6d397a6a8921e889395b1fd7d339e3f87336e31ec440db9fe041df 2014C-3655_1.fastq.gz" >> sha256sum.txt
echo "b92844a406681174a0457010f67862acc40718c70167e22d37d23cf6fefad353 2014C-3655_2.fastq.gz" >> sha256sum.txt
echo "c4958e7a3f0c541831fffcaa8c75aafbda4ddf90a6d886bc44d236318dd2abcb 2014C-3840_1.fastq.gz" >> sha256sum.txt
echo "85b8f96a6e68e52ffd4a71715a36372ee3556e912c29d384adb9f1daae8cba92 2014C-3840_2.fastq.gz" >> sha256sum.txt
echo "15a2ce983032b83665dff32390f7cd08459f82433ebf180d376f12c03d867c0a 2014C-3857_1.fastq.gz" >> sha256sum.txt
echo "e729ba4a6fdfd2e4c9928e34a095865ffe1af79056c739fd83f404551f2d7e2e 2014C-3857_2.fastq.gz" >> sha256sum.txt
echo "dcb4efeadf7a70021fc5302787769dbda28967ff10c8441ddd9ae0b11cf3a368 2014C-3907_1.fastq.gz" >> sha256sum.txt
echo "0442d6d2d03c69570caab7a6c68f80d8843c668c19f4fd025eddf900d96a5f3d 2014C-3907_2.fastq.gz" >> sha256sum.txt
echo "be970c3123b93b3f583b8e21687bc12b56e3a83a8d1bea913a2533395fc0558e 2014C-3850_1.fastq.gz" >> sha256sum.txt
echo "90e14abb9ecdc9fbb8fcc395f6a9f446afd3610de6395e90eb3ff5cab09b93a2 2014C-3850_2.fastq.gz" >> sha256sum.txt
sha256sum -c sha256sum.txt
2011C-3609.gbk: FAILED
2014C-3598_1.fastq.gz: OK
2014C-3598_2.fastq.gz: OK
2014C-3599_1.fastq.gz: OK
2014C-3599_2.fastq.gz: OK
2014C-3600_1.fastq.gz: OK
2014C-3600_2.fastq.gz: OK
2014C-3656_1.fastq.gz: OK
2014C-3656_2.fastq.gz: OK
2014C-3655_1.fastq.gz: OK
2014C-3655_2.fastq.gz: OK
2014C-3840_1.fastq.gz: OK
2014C-3840_2.fastq.gz: OK
2014C-3857_1.fastq.gz: OK
2014C-3857_2.fastq.gz: OK
2014C-3907_1.fastq.gz: OK
2014C-3907_2.fastq.gz: OK
2014C-3850_1.fastq.gz: OK
2014C-3850_2.fastq.gz: OK
sha256sum: WARNING: 1 computed checksum did NOT match
Makefile:92: recipe for target 'sha256sum.txt' failed
make: *** [sha256sum.txt] Error 1
make: Leaving directory '/home/staphb/e-coli-O121-H19-sample-data'
GenFSGopher.pl: ERROR: `make` failed. Please address all errors and then run the make command again:
nice make all --directory=e-coli-O121-H19-sample-data/ --jobs=4
Died at /home/staphb/Downloads/datasets/scripts/GenFSGopher.pl line 332.
@kapsakcj I just wanted to let you know I haven't forgotten! I'll get back to it in the next couple of weeks. I think that it might have to do with NCBI Assembly versioning
Thanks! It's not a high priority or anything. Kevin was using it for the AMD academy workshop, and he only wanted the reads anyways. Just thought I'd start an issue in case others encounter this issue in the future.
I think I may have addressed this issue in pull request #5(?)
I'll leave this open just in case, for now
I tried downloading the datasets again, using the latest commits/PR from @dfornika and ran into the same issue with the E. coli dataset, and a weird issue with the Salmonella dataset.
E. coli. Ran this command:
GenFSGopher.pl --layout onedir -o ecoli datasets/Escherichia_coli_1405WAEXK-1.tsv --numcpus 12
Saw this in the output:
sha256sum -c sha256sum.txt
2011C-3609.gbk: FAILED
2014C-3598_1.fastq.gz: OK
2014C-3598_2.fastq.gz: OK
...
sha256sum: WARNING: 1 computed checksum did NOT match
Makefile:92: recipe for target 'sha256sum.txt' failed
make: *** [sha256sum.txt] Error 1
If I run sha256sum on the genbank file manually, this is the output, which is different than the hashsum Dan put in his PR.
sha256sum 2011C-3609.gbk
6911f9e255f7fe48a79a9fb157bd844ab87970bd312a62fee7b05c85102c1bb3 2011C-3609.gbk
Salmonella is a different issue. When I ran this:
GenFSGopher.pl --layout onedir -o tuna-scrape-sample-data-take3/ datasets/Salmonella_enterica_1203NYJAP-1.tsv
It takes an error on this command in the script:
esearch -db assembly -query 'GCA_000698635.1 NOT refseq[filter]' | elink -target nuccore -name assembly_nuccore_insdc | efetch -format gb -style withparts > CFSAN000191.gbk
ERROR in acheck test: Empty result - nothing to do
ERROR in fetch input: Empty result - nothing to do
Makefile:28: recipe for target 'CFSAN000191.gbk' failed
make: *** [CFSAN000191.gbk] Error 255
@lskatz and I did a little testing and when we manually run that command, with the elink
command adjusted to remove the --name assembly_nuccore_insdc
portion
esearch -db assembly -query 'GCA_000698635.1 NOT refseq[filter]' | elink -target nuccore | efetch -format fasta | grep ">"
>NZ_JMMH01000061.1 Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000191 CFSAN000191_contig0060, whole genome shotgun sequence
>NZ_JMMH01000060.1 Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000191 CFSAN000191_contig0059, whole genome shotgun sequence
>NZ_JMMH01000059.1 Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000191 CFSAN000191_contig0058, whole genome shotgun sequence
...
...
Hope this helps! Happy to continue testing to get these things resolved.
Hi, this is my first time using this repo, so I was wondering if anyone has encountered this issue before. I'd like to download the
Salmonella_enterica_1203NYJAP-1
set of reads and assemblies for use in a bioinfo workshop.I believe I have all dependencies:
WGS-standards-and-analysis/datasets
repo this morningOS - Ubuntu 16.04
Then I ran:
Output looked good for a while, downloading reads for each isolate...
Then when it got around to checking the checksums, it failed for every .gbk file.
I've tried this a couple of times now, and can't figure out why all the checksums are not matching. Any ideas or potential solutions? CC @kevinlibuit