Closed lskatz closed 2 years ago
Sure thing, we can add in an additional test in the test layer of the dockerfile for 0.6.3.
Could add another RUN
layer below this one:
Thanks @kapsakcj that would be perfect!
Ah, I was getting an checksums mismatch error with v0.6.3 on line 87 of the dockerfile (linked above), but trying out v0.7 now.
I see that some of the checksums for some of the samples in the vocvoi dataset were updated here https://github.com/CDCgov/datasets-sars-cov-2/commit/bd8d3ccf17fdbdc056618e3e2df2fcf5f13a2535
Hopefully the new version will resolve the issue, but either way I will start a PR with your suggested addition
OK, so upgrading to v0.7 code did fix the checksums mismatches for the vocvoi dataset, but now I'm getting checksum mismatch errors with the SNF-A dataset. I'm thinking those checksums in the TSV need to be updated?
Probably should open an issue over on https://github.com/CDCgov/datasets-sars-cov-2/issues
but just to provide a little context...
upon running in the docker image testing layer:
GenFSGopher.pl -o SNF-A-output /home/user/datasets-sars-cov-2/datasets/sars-cov-2-SNF-A.tsv --numcpus $(nproc --all) --layout onedir --compressed
I'm getting errors like this:
MA_MGH_00229_1.fastq: sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
85.3% -- replaced with MA_MGH_00229_1.fastq.gz
MA_MGH_00229_2.fastq: sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
I started a branch with the v0.7 dockerfile here: https://github.com/StaPH-B/docker-builds/blob/cjk-sarscov2-datasets-add-test/datasets-sars-cov-2/0.7.0/Dockerfile
It fails to build on the last test layer, due to the checksums mismatch with the SNF-A dataset
Can you try again? I just uncovered several checksum mistakes and I think I ironed it all out in v0.7.1.
Sure, I upgraded to v0.7.1 and am still seeing some checksum mismatch warnings with the SNF-A dataset
Here's the line that it's throwing the warnings: https://github.com/StaPH-B/docker-builds/blob/17a0264d223f575a7991f005b8c052a0d371915a/datasets-sars-cov-2/0.7.1/Dockerfile#L95
the checksum warnings start here:
MA_MGH_00229_1.fastq: sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
85.3% -- replaced with MA_MGH_00229_1.fastq.gz
MA_MGH_00229_2.fastq: sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
83.9% -- replaced with MA_MGH_00229_2.fastq.gz
MA_MGH_00230_1.fastq: sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
85.0% -- replaced with MA_MGH_00230_1.fastq.gz
MA_MGH_00230_2.fastq: sha256sum: WARNING: 1 computed checksum did NOT match
Might be related to some wget
warnings that occur earlier on in the script?
fastq-dump --defline-seq '@$ac_$sn/$ri' --defline-qual '+' --split-3 -O . SRR11953686
ERROR: wget command failed ( Thu Jun 9 14:30:11 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
HTTP/1.1 400 Bad Request
ERROR: FAILURE ( Thu Jun 9 14:30:11 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
QUERY FAILURE
ERROR: wget command failed ( Thu Jun 9 14:30:13 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
HTTP/1.1 400 Bad Request
esearch -db nucleotide -query 'MT520505.1' | efetch -format fasta > MA_MGH_00304.fna
ERROR: wget command failed ( Thu Jun 9 14:30:15 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
HTTP/1.1 400 Bad Request
WARNING: FAILURE ( Thu Jun 9 14:30:14 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
SECOND ATTEMPT
ERROR: wget command failed ( Thu Jun 9 14:30:16 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
HTTP/1.1 400 Bad Request
WARNING: FAILURE ( Thu Jun 9 14:30:16 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
LAST ATTEMPT
ERROR: wget command failed ( Thu Jun 9 14:30:18 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
HTTP/1.1 400 Bad Request
ERROR: FAILURE ( Thu Jun 9 14:30:18 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
QUERY FAILURE
ERROR: wget command failed ( Thu Jun 9 14:30:20 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
HTTP/1.1 400 Bad Request
Downloading MA_MGH_00305_1.fastq SRR11954295
fastq-dump --defline-seq '@$ac_$sn/$ri' --defline-qual '+' --split-3 -O . SRR11954295
Read 1125995 spots for SRR11953686
Written 1125995 spots for SRR11953686
if [ ! -f ./SRR11953686_1.fastq ]; then mv ./SRR11953686.fastq ./SRR11953686_1.fastq; elif [ -f ./SRR11953686_1.fastq -a -f ./SRR11953686_2.fastq ]; then rm -f ./SRR11953686.fastq; fi
touch ./SRR11953686_2.fastq
mv ./SRR11953686_1.fastq 'MA_MGH_00304_1.fastq'
esearch -db nucleotide -query 'MT520507.1' | efetch -format fasta > MA_MGH_00305.fna
Read 725329 spots for SRR11954295
Written 725329 spots for SRR11954295
Yes it's possible it's related to the wget errors. I actually had to restart the GitHub Actions CI multiple times (without committing any new changes to the code) because it was being run in parallel and I think it was getting blocked by NCBI. I can either reduce the parallelism or see if you want to try again.
I tried again but was hit with the same checksum mismatch errors.
This time I reduced GenFSGopher.pl to use --numcpus 1
and even used my NCBI_API_KEY to increase the rate limit. I have a feeling it's not related to rate-limiting/NCBI blocking. https://github.com/StaPH-B/docker-builds/commit/becbd6ff4add3fb9c6601b82b487a038d8695385
So that's 1 issue.
Another (I think) separate issue is the wget
errors which are occurring when fetching the assemblies. For example, when the script runs esearch -db nucleotide -query 'MT520263.1' | efetch -format fasta > MA_MGH_00229.fna
, it returns with the wget error:
ERROR: wget command failed ( Thu Jun 9 21:30:27 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&api_key=<SCRUBBED-BY-CURTIS>&tool=edirect&edirect=17.1&edirect_os=Linux&email=curtis.kapsak%40theiagen.com https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
HTTP/1.1 400 Bad Request
WARNING: FAILURE ( Thu Jun 9 21:30:27 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -api_key <SCRUBBED-BY-CURTIS> -tool edirect -edirect 17.1 -edirect_os Linux -email curtis.kapsak@theiagen.com
EMPTY RESULT
SECOND ATTEMPT
Looks like we're missing the xtract
executable based on the cryptic/hard-to-read error message. I'm trying to install it manually, but it's a PITA trying to run nquire
in the docker image build. Can't figure out why it won't download the xtract
binary.
trying to add this to the dockerfile, but it keeps failing since the nquire
command doesn't actually download the binary for xtract
RUN /edirect/nquire -dwn ftp.ncbi.nlm.nih.gov entrez/entrezdirect xtract.Linux.gz 2>&1 && ls -a && find / -name "xtract*" && \
gunzip -f xtract.Linux.gz && \
mv -v xtract.Linux /usr/local/bin/xtract && \
chmod +x /usr/local/bin/xtract
docker build output:
RUN /edirect/nquire -dwn ftp.ncbi.nlm.nih.gov entrez/entrezdirect xtract.Linux.gz 2>&1 && ls -a && find / -name "xtract*" && gunzip -f xtract.Linux.gz && mv -v xtract.Linux /usr/local/bin/xtract && chmod +x /usr/local/bin/xtract
---> Running in 12d6fb7ae634
.
..
vocvoi-output
/edirect/xtract
/edirect/cmd/xtract.go
/edirect/help/xtract-unix.txt
/edirect/help/xtract-examples.txt
/edirect/help/xtract-help.txt
/edirect/help/xtract-keys.txt
/edirect/help/xtract-internal.txt
gzip: xtract.Linux.gz: No such file or directory
The command '/bin/sh -c /edirect/nquire -dwn ftp.ncbi.nlm.nih.gov entrez/entrezdirect xtract.Linux.gz 2>&1 && ls -a && find / -name "xtract*" && gunzip -f xtract.Linux.gz && mv -v xtract.Linux /usr/local/bin/xtract && chmod +x /usr/local/bin/xtract' returned a non-zero code: 1
Thanks for humoring me. I spent some time going over and over again for a new version v0.7.2 and I think it is all fixed? Can you try again? https://github.com/CDCgov/datasets-sars-cov-2/releases/tag/v0.7.2
yes, will try again when I have some time early next week. Thanks for working on the checksums! I really want to get this working as intended!
ok, just opened a PR with my latest dockerfile for 0.7.2. The upgrade to v0.7.2 as well as the addition of the NCBI xtract
tool resolved the wget
errors as well as the checksum mismatch errors.
The SNF-A dataset will take a long time to finish downloading/compressing/checksum verfication, but we should see everything pass here 🤞 https://github.com/StaPH-B/docker-builds/runs/6865408463?check_suite_focus=true
Caught by the new sra toolkit. I'll have to update versions.
😩 which means updating the sratoolkit container....
which means updating the sratoolkit container
I don't want to go down that road again. It's not fun or easy ☹️
I would try NCBI's docker image or try putting a pre-compiled binary in a docker image to keep things simple.
https://github.com/ncbi/sra-tools/wiki/SRA-tools-docker
https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit#the-current-binaries-for
addressed via #387 . Tests pass when building the docker image locally, but fail on GH Actions runners due to the sheer size of the fastq files from the SNF-A dataset. The hard drive fills up and runs out of space before the test finishes.
docker image for 0.7.2 is now deployed to dockerhub and quay under the latest
and 0.7.2
docker image tags:
https://hub.docker.com/r/staphb/datasets-sars-cov-2/tags
https://quay.io/repository/staphb/datasets-sars-cov-2?tab=tags
Contact Details
gzu2@cdc.gov
What container needs an update?
Hi! I was wondering if you could add one more test onto the sars-cov-2 dataset to include a dataset that has assemblies too? For example, https://github.com/CDCgov/datasets-sars-cov-2/blob/master/datasets/sars-cov-2-SNF-A.tsv
Currently it only tests the smallest dataset which is smart, but it also does not test the assembly download. Thank you for your consideration on that!