WGS-standards-and-analysis / datasets

Benchmark datasets for WGS analysis
37 stars 18 forks source link

checksums for reads are OK, but not for assemblies #12

Open kapsakcj opened 5 years ago

kapsakcj commented 5 years ago

Hi, this is my first time using this repo, so I was wondering if anyone has encountered this issue before. I'd like to download the Salmonella_enterica_1203NYJAP-1 set of reads and assemblies for use in a bioinfo workshop.

I believe I have all dependencies:

OS - Ubuntu 16.04

Then I ran:

GenFSGopher.pl -o tuna-scrape-sample-data-take2/ --layout onedir ~/Downloads/datasets/datasets/Salmonella_enterica_1203NYJAP-1.tsv

Output looked good for a while, downloading reads for each isolate...

...
2019-01-15 20:17:26 (25.2 MB/s) - ‘tree.dnd’ saved [407/407]

Downloading CFSAN000189_1.fastq.gz SRR498276
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR498276
Read 1832760 spots for SRR498276
Written 1832760 spots for SRR498276
mv ./SRR498276_1.fastq.gz CFSAN000189_1.fastq.gz
mv ./SRR498276_2.fastq.gz CFSAN000189_2.fastq.gz
esearch -db assembly -query 'GCA_000439415.1 NOT refseq[filter]' | elink -target nuccore -name assembly_nuccore_insdc | efetch -format gbwithparts > CFSAN000189.gbk
...

Then when it got around to checking the checksums, it failed for every .gbk file.

...
echo "0b589587529572591da23ed6ba6640875df2bfbcb488a2470f3463d4bdb85aef  CFSAN001140_1.fastq.gz" >> sha256sum.txt
echo "8a262325936ae5731857d49a6388143b4683785f91e9185a2f9a8e54de82b2e4  CFSAN001140_2.fastq.gz" >> sha256sum.txt
echo "4a9eaaccaf2ab67798d929079ae632c86050db53ed0f958bf7f228f07ed78d73  CFSAN001140.gbk" >> sha256sum.txt
sha256sum -c sha256sum.txt
CFSAN000189_1.fastq.gz: OK
CFSAN000189_2.fastq.gz: OK
CFSAN000189.gbk: FAILED
CFSAN000191_1.fastq.gz: OK
CFSAN000191_2.fastq.gz: OK
CFSAN000191.gbk: FAILED
CFSAN000211_1.fastq.gz: OK
CFSAN000211_2.fastq.gz: OK
CFSAN000211.gbk: FAILED
CFSAN000212_1.fastq.gz: OK
CFSAN000212_2.fastq.gz: OK
CFSAN000212.gbk: FAILED
...
sha256sum: WARNING: 23 computed checksums did NOT match
Makefile:336: recipe for target 'sha256sum.txt' failed
make: *** [sha256sum.txt] Error 1
make: Leaving directory '/home/staphb/tuna-scrape-sample-data-take2'
GenFSGopher.pl: ERROR: `make` failed.  Please address all errors and then run the make command again:
  nice make all --directory=tuna-scrape-sample-data-take2/ --jobs=1
Died at /home/staphb/Downloads/datasets/scripts/GenFSGopher.pl line 332.

I've tried this a couple of times now, and can't figure out why all the checksums are not matching. Any ideas or potential solutions? CC @kevinlibuit

kapsakcj commented 5 years ago

Update - I haven't dug into this issue any deeper, but I did encounter the same issue with the E. coli data set as well. All the checksums for the reads (fastq.gz) matched OK, but for the one .gbk file, the checksum did not match.

$ GenFSGopher.pl -o e-coli-O121-H19-sample-data/ --layout onedir --numcpus 4 Downloads/datasets/datasets/Escherichia_coli_1405WAEXK-1.tsv
make: Entering directory '/home/staphb/e-coli-O121-H19-sample-data'
wget -O tree.dnd 'http://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre'
esearch -db assembly -query 'GCA_000703365.1 NOT refseq[filter]' | elink -target nuccore -name assembly_nuccore_insdc | efetch -format gbwithparts > 2011C-3609.gbk
Downloading 2014C-3598_1.fastq.gz SRR1609861
Downloading 2014C-3599_1.fastq.gz SRR1609862
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1609861
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1609862
--2019-01-22 19:21:04--  http://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre
Resolving api.opentreeoflife.org (api.opentreeoflife.org)... 34.221.232.64
Connecting to api.opentreeoflife.org (api.opentreeoflife.org)|34.221.232.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre [following]
--2019-01-22 19:21:05--  https://api.opentreeoflife.org/v2/study/ot_301/tree/tree3.tre
Connecting to api.opentreeoflife.org (api.opentreeoflife.org)|34.221.232.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 280 [text/plain]
Saving to: ‘tree.dnd’

tree.dnd                          100%[============================================================>]     280  --.-KB/s    in 0s

2019-01-22 19:21:05 (17.9 MB/s) - ‘tree.dnd’ saved [280/280]

Downloading 2014C-3600_1.fastq.gz SRR1609871
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1609871
Downloading 2014C-3656_1.fastq.gz SRR1610029
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610029
Read 519740 spots for SRR1609861
Written 519740 spots for SRR1609861
mv ./SRR1609861_1.fastq.gz 2014C-3598_1.fastq.gz
Downloading 2014C-3655_1.fastq.gz SRR1610028
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610028
Read 704756 spots for SRR1609862
Written 704756 spots for SRR1609862
mv ./SRR1609862_1.fastq.gz 2014C-3599_1.fastq.gz
Downloading 2014C-3840_1.fastq.gz SRR1610031
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610031
Read 685446 spots for SRR1610028
Written 685446 spots for SRR1610028
mv ./SRR1610028_1.fastq.gz 2014C-3655_1.fastq.gz
Downloading 2014C-3857_1.fastq.gz SRR1610033
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610033
Read 1847324 spots for SRR1610029
Written 1847324 spots for SRR1610029
mv ./SRR1610029_1.fastq.gz 2014C-3656_1.fastq.gz
Downloading 2014C-3907_1.fastq.gz SRR1610034
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610034
Read 1182546 spots for SRR1610031
Written 1182546 spots for SRR1610031
mv ./SRR1610031_1.fastq.gz 2014C-3840_1.fastq.gz
Downloading 2014C-3850_1.fastq.gz SRR1610032
fastq-dump --defline-seq '@$ac_$sn[_$rn]/$ri' --defline-qual '+' --split-files -O . --gzip SRR1610032
Read 1597142 spots for SRR1609871
Written 1597142 spots for SRR1609871
mv ./SRR1609871_1.fastq.gz 2014C-3600_1.fastq.gz
esearch -db assembly -query 'GCA_000703365.1 NOT refseq[filter]' | elink -target nuccore -name assembly_nuccore_insdc | efetch -format fasta > 2011C-3609.fasta
mv ./SRR1609861_2.fastq.gz 2014C-3598_2.fastq.gz
mv ./SRR1609862_2.fastq.gz 2014C-3599_2.fastq.gz
mv ./SRR1609871_2.fastq.gz 2014C-3600_2.fastq.gz
mv ./SRR1610029_2.fastq.gz 2014C-3656_2.fastq.gz
mv ./SRR1610028_2.fastq.gz 2014C-3655_2.fastq.gz
mv ./SRR1610031_2.fastq.gz 2014C-3840_2.fastq.gz
Read 748141 spots for SRR1610034
Written 748141 spots for SRR1610034
mv ./SRR1610034_1.fastq.gz 2014C-3907_1.fastq.gz
mv ./SRR1610034_2.fastq.gz 2014C-3907_2.fastq.gz
Read 1143685 spots for SRR1610033
Written 1143685 spots for SRR1610033
mv ./SRR1610033_1.fastq.gz 2014C-3857_1.fastq.gz
mv ./SRR1610033_2.fastq.gz 2014C-3857_2.fastq.gz
Read 1124002 spots for SRR1610032
Written 1124002 spots for SRR1610032
mv ./SRR1610032_1.fastq.gz 2014C-3850_1.fastq.gz
mv ./SRR1610032_2.fastq.gz 2014C-3850_2.fastq.gz
rm -f sha256sum.txt
echo "342ea035eb6c14b230c6c31ebda5594fa22628e6a871e6cfc19c156a595e3dfc  2011C-3609.gbk" >> sha256sum.txt
echo "0264dc57015100fe41a9a9167145439eb383ea8216b49785a09ce5157293e080  2014C-3598_1.fastq.gz" >> sha256sum.txt
echo "f3d4eb398736b36babe38659a7b616a6938b0be3837d418d3097aace6ac32e5f  2014C-3598_2.fastq.gz" >> sha256sum.txt
echo "e703810bb7a488e99cfc7f72e31379cebd69f73fd660188723c3369bd0075aba  2014C-3599_1.fastq.gz" >> sha256sum.txt
echo "96a69d312a10798e0f28c4de3f17ca8050078f8f03d82238bd48600309941b74  2014C-3599_2.fastq.gz" >> sha256sum.txt
echo "5cb588797c9a3494d37dab3e2a20c4a82d78bccffd33fec8e38b6d122b03d058  2014C-3600_1.fastq.gz" >> sha256sum.txt
echo "9f4532d45c3cb1325af69bda5162f0ba36d3559c282855578c9ffd9206e3b474  2014C-3600_2.fastq.gz" >> sha256sum.txt
echo "fdbb2fdf7a294fb7831d2a75953c304c4da766816f51f8695b5eff2149176780  2014C-3656_1.fastq.gz" >> sha256sum.txt
echo "bfff096d4119bb7a69f9e6e70aac45eb9763437cdbeeb968aafd53bd955d8a40  2014C-3656_2.fastq.gz" >> sha256sum.txt
echo "af41ef9c5a6d397a6a8921e889395b1fd7d339e3f87336e31ec440db9fe041df  2014C-3655_1.fastq.gz" >> sha256sum.txt
echo "b92844a406681174a0457010f67862acc40718c70167e22d37d23cf6fefad353  2014C-3655_2.fastq.gz" >> sha256sum.txt
echo "c4958e7a3f0c541831fffcaa8c75aafbda4ddf90a6d886bc44d236318dd2abcb  2014C-3840_1.fastq.gz" >> sha256sum.txt
echo "85b8f96a6e68e52ffd4a71715a36372ee3556e912c29d384adb9f1daae8cba92  2014C-3840_2.fastq.gz" >> sha256sum.txt
echo "15a2ce983032b83665dff32390f7cd08459f82433ebf180d376f12c03d867c0a  2014C-3857_1.fastq.gz" >> sha256sum.txt
echo "e729ba4a6fdfd2e4c9928e34a095865ffe1af79056c739fd83f404551f2d7e2e  2014C-3857_2.fastq.gz" >> sha256sum.txt
echo "dcb4efeadf7a70021fc5302787769dbda28967ff10c8441ddd9ae0b11cf3a368  2014C-3907_1.fastq.gz" >> sha256sum.txt
echo "0442d6d2d03c69570caab7a6c68f80d8843c668c19f4fd025eddf900d96a5f3d  2014C-3907_2.fastq.gz" >> sha256sum.txt
echo "be970c3123b93b3f583b8e21687bc12b56e3a83a8d1bea913a2533395fc0558e  2014C-3850_1.fastq.gz" >> sha256sum.txt
echo "90e14abb9ecdc9fbb8fcc395f6a9f446afd3610de6395e90eb3ff5cab09b93a2  2014C-3850_2.fastq.gz" >> sha256sum.txt
sha256sum -c sha256sum.txt
2011C-3609.gbk: FAILED
2014C-3598_1.fastq.gz: OK
2014C-3598_2.fastq.gz: OK
2014C-3599_1.fastq.gz: OK
2014C-3599_2.fastq.gz: OK
2014C-3600_1.fastq.gz: OK
2014C-3600_2.fastq.gz: OK
2014C-3656_1.fastq.gz: OK
2014C-3656_2.fastq.gz: OK
2014C-3655_1.fastq.gz: OK
2014C-3655_2.fastq.gz: OK
2014C-3840_1.fastq.gz: OK
2014C-3840_2.fastq.gz: OK
2014C-3857_1.fastq.gz: OK
2014C-3857_2.fastq.gz: OK
2014C-3907_1.fastq.gz: OK
2014C-3907_2.fastq.gz: OK
2014C-3850_1.fastq.gz: OK
2014C-3850_2.fastq.gz: OK
sha256sum: WARNING: 1 computed checksum did NOT match
Makefile:92: recipe for target 'sha256sum.txt' failed
make: *** [sha256sum.txt] Error 1
make: Leaving directory '/home/staphb/e-coli-O121-H19-sample-data'
GenFSGopher.pl: ERROR: `make` failed.  Please address all errors and then run the make command again:
  nice make all --directory=e-coli-O121-H19-sample-data/ --jobs=4
Died at /home/staphb/Downloads/datasets/scripts/GenFSGopher.pl line 332.
lskatz commented 5 years ago

@kapsakcj I just wanted to let you know I haven't forgotten! I'll get back to it in the next couple of weeks. I think that it might have to do with NCBI Assembly versioning

kapsakcj commented 5 years ago

Thanks! It's not a high priority or anything. Kevin was using it for the AMD academy workshop, and he only wanted the reads anyways. Just thought I'd start an issue in case others encounter this issue in the future.

dfornika commented 5 years ago

I think I may have addressed this issue in pull request #5(?)

lskatz commented 5 years ago

I'll leave this open just in case, for now

kapsakcj commented 5 years ago

I tried downloading the datasets again, using the latest commits/PR from @dfornika and ran into the same issue with the E. coli dataset, and a weird issue with the Salmonella dataset.

E. coli. Ran this command:

GenFSGopher.pl --layout onedir -o ecoli datasets/Escherichia_coli_1405WAEXK-1.tsv --numcpus 12

Saw this in the output:

sha256sum -c sha256sum.txt
2011C-3609.gbk: FAILED
2014C-3598_1.fastq.gz: OK
2014C-3598_2.fastq.gz: OK
...
sha256sum: WARNING: 1 computed checksum did NOT match
Makefile:92: recipe for target 'sha256sum.txt' failed
make: *** [sha256sum.txt] Error 1

If I run sha256sum on the genbank file manually, this is the output, which is different than the hashsum Dan put in his PR.

sha256sum 2011C-3609.gbk
6911f9e255f7fe48a79a9fb157bd844ab87970bd312a62fee7b05c85102c1bb3  2011C-3609.gbk

Salmonella is a different issue. When I ran this:

GenFSGopher.pl --layout onedir -o tuna-scrape-sample-data-take3/ datasets/Salmonella_enterica_1203NYJAP-1.tsv

It takes an error on this command in the script:

esearch -db assembly -query 'GCA_000698635.1 NOT refseq[filter]' | elink -target nuccore -name assembly_nuccore_insdc | efetch -format gb -style withparts > CFSAN000191.gbk
ERROR in acheck test: Empty result - nothing to do
ERROR in fetch input: Empty result - nothing to do

Makefile:28: recipe for target 'CFSAN000191.gbk' failed
make: *** [CFSAN000191.gbk] Error 255

@lskatz and I did a little testing and when we manually run that command, with the elink command adjusted to remove the --name assembly_nuccore_insdc portion

esearch -db assembly -query 'GCA_000698635.1 NOT refseq[filter]' | elink -target nuccore | efetch -format fasta | grep ">"
>NZ_JMMH01000061.1 Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000191 CFSAN000191_contig0060, whole genome shotgun sequence
>NZ_JMMH01000060.1 Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000191 CFSAN000191_contig0059, whole genome shotgun sequence
>NZ_JMMH01000059.1 Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000191 CFSAN000191_contig0058, whole genome shotgun sequence
...
...

Hope this helps! Happy to continue testing to get these things resolved.