StaPH-B / docker-builds

:package: :whale: Dockerfiles and documentation on tools for public health bioinformatics
GNU General Public License v3.0
191 stars 119 forks source link

[Request An Update]: sars-cov-2 dataset #384

Closed lskatz closed 2 years ago

lskatz commented 2 years ago

Contact Details

gzu2@cdc.gov

What container needs an update?

Hi! I was wondering if you could add one more test onto the sars-cov-2 dataset to include a dataset that has assemblies too? For example, https://github.com/CDCgov/datasets-sars-cov-2/blob/master/datasets/sars-cov-2-SNF-A.tsv

Currently it only tests the smallest dataset which is smart, but it also does not test the assembly download. Thank you for your consideration on that!

kapsakcj commented 2 years ago

Sure thing, we can add in an additional test in the test layer of the dockerfile for 0.6.3.

Could add another RUN layer below this one:

https://github.com/StaPH-B/docker-builds/blob/38c55edbd9db64347b47aada57199a9217bb05fa/datasets-sars-cov-2/0.6.3/Dockerfile#L87

lskatz commented 2 years ago

Thanks @kapsakcj that would be perfect!

kapsakcj commented 2 years ago

Ah, I was getting an checksums mismatch error with v0.6.3 on line 87 of the dockerfile (linked above), but trying out v0.7 now.

I see that some of the checksums for some of the samples in the vocvoi dataset were updated here https://github.com/CDCgov/datasets-sars-cov-2/commit/bd8d3ccf17fdbdc056618e3e2df2fcf5f13a2535

Hopefully the new version will resolve the issue, but either way I will start a PR with your suggested addition

kapsakcj commented 2 years ago

OK, so upgrading to v0.7 code did fix the checksums mismatches for the vocvoi dataset, but now I'm getting checksum mismatch errors with the SNF-A dataset. I'm thinking those checksums in the TSV need to be updated?

Probably should open an issue over on https://github.com/CDCgov/datasets-sars-cov-2/issues

but just to provide a little context...

upon running in the docker image testing layer:

GenFSGopher.pl -o SNF-A-output /home/user/datasets-sars-cov-2/datasets/sars-cov-2-SNF-A.tsv --numcpus $(nproc --all) --layout onedir --compressed

I'm getting errors like this:

MA_MGH_00229_1.fastq:   sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
 85.3% -- replaced with MA_MGH_00229_1.fastq.gz
MA_MGH_00229_2.fastq:   sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
kapsakcj commented 2 years ago

I started a branch with the v0.7 dockerfile here: https://github.com/StaPH-B/docker-builds/blob/cjk-sarscov2-datasets-add-test/datasets-sars-cov-2/0.7.0/Dockerfile

It fails to build on the last test layer, due to the checksums mismatch with the SNF-A dataset

lskatz commented 2 years ago

Can you try again? I just uncovered several checksum mistakes and I think I ironed it all out in v0.7.1.

kapsakcj commented 2 years ago

Sure, I upgraded to v0.7.1 and am still seeing some checksum mismatch warnings with the SNF-A dataset

Here's the line that it's throwing the warnings: https://github.com/StaPH-B/docker-builds/blob/17a0264d223f575a7991f005b8c052a0d371915a/datasets-sars-cov-2/0.7.1/Dockerfile#L95

the checksum warnings start here:

MA_MGH_00229_1.fastq:   sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
 85.3% -- replaced with MA_MGH_00229_1.fastq.gz
MA_MGH_00229_2.fastq:   sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
 83.9% -- replaced with MA_MGH_00229_2.fastq.gz
MA_MGH_00230_1.fastq:   sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
sha256sum: WARNING: 1 computed checksum did NOT match
 85.0% -- replaced with MA_MGH_00230_1.fastq.gz
MA_MGH_00230_2.fastq:   sha256sum: WARNING: 1 computed checksum did NOT match

Might be related to some wget warnings that occur earlier on in the script?

fastq-dump --defline-seq '@$ac_$sn/$ri' --defline-qual '+' --split-3 -O . SRR11953686
 ERROR:  wget command failed ( Thu Jun  9 14:30:11 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  HTTP/1.1 400 Bad Request
 ERROR:  FAILURE ( Thu Jun  9 14:30:11 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
QUERY FAILURE
 ERROR:  wget command failed ( Thu Jun  9 14:30:13 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  HTTP/1.1 400 Bad Request
esearch -db nucleotide -query 'MT520505.1' | efetch -format fasta > MA_MGH_00304.fna
 ERROR:  wget command failed ( Thu Jun  9 14:30:15 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  HTTP/1.1 400 Bad Request
 WARNING:  FAILURE ( Thu Jun  9 14:30:14 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
SECOND ATTEMPT
 ERROR:  wget command failed ( Thu Jun  9 14:30:16 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  HTTP/1.1 400 Bad Request
 WARNING:  FAILURE ( Thu Jun  9 14:30:16 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
LAST ATTEMPT
 ERROR:  wget command failed ( Thu Jun  9 14:30:18 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  HTTP/1.1 400 Bad Request
 ERROR:  FAILURE ( Thu Jun  9 14:30:18 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -tool edirect -edirect 17.1 -edirect_os Linux -email root@0e6c1f813b2a
EMPTY RESULT
QUERY FAILURE
 ERROR:  wget command failed ( Thu Jun  9 14:30:20 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&tool=edirect&edirect=17.1&edirect_os=Linux&email=root%400e6c1f813b2a https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  HTTP/1.1 400 Bad Request
Downloading MA_MGH_00305_1.fastq SRR11954295
fastq-dump --defline-seq '@$ac_$sn/$ri' --defline-qual '+' --split-3 -O . SRR11954295
Read 1125995 spots for SRR11953686
Written 1125995 spots for SRR11953686
if [ ! -f ./SRR11953686_1.fastq ]; then mv ./SRR11953686.fastq ./SRR11953686_1.fastq;  elif [ -f ./SRR11953686_1.fastq -a -f ./SRR11953686_2.fastq ]; then rm -f ./SRR11953686.fastq; fi
touch ./SRR11953686_2.fastq
mv ./SRR11953686_1.fastq 'MA_MGH_00304_1.fastq'
esearch -db nucleotide -query 'MT520507.1' | efetch -format fasta > MA_MGH_00305.fna
Read 725329 spots for SRR11954295
Written 725329 spots for SRR11954295
lskatz commented 2 years ago

Yes it's possible it's related to the wget errors. I actually had to restart the GitHub Actions CI multiple times (without committing any new changes to the code) because it was being run in parallel and I think it was getting blocked by NCBI. I can either reduce the parallelism or see if you want to try again.

kapsakcj commented 2 years ago

I tried again but was hit with the same checksum mismatch errors.

This time I reduced GenFSGopher.pl to use --numcpus 1 and even used my NCBI_API_KEY to increase the rate limit. I have a feeling it's not related to rate-limiting/NCBI blocking. https://github.com/StaPH-B/docker-builds/commit/becbd6ff4add3fb9c6601b82b487a038d8695385

So that's 1 issue.

Another (I think) separate issue is the wget errors which are occurring when fetching the assemblies. For example, when the script runs esearch -db nucleotide -query 'MT520263.1' | efetch -format fasta > MA_MGH_00229.fna, it returns with the wget error:

ERROR:  wget command failed ( Thu Jun  9 21:30:27 UTC 2022 ) with: 8
--post-data=db=nucleotide&id=Unable%2Cto%2Clocate%2Cxtract%2Cexecutable.%2CPlease%2Cexecute%2Cthe%2Cfollowing%2Cnquire%2Cdwn%2Cftp.ncbi.nlm.nih.gov%2Centrez%2Centrezdirect%2Cxtract.Linux.gz%2Cgunzip%2Cf%2Cxtract.Linux.gz%2Cchmod%2Cx%2Cxtract.Linux&rettype=fasta&retmode=text&api_key=<SCRUBBED-BY-CURTIS>&tool=edirect&edirect=17.1&edirect_os=Linux&email=curtis.kapsak%40theiagen.com https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  HTTP/1.1 400 Bad Request
 WARNING:  FAILURE ( Thu Jun  9 21:30:27 UTC 2022 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -db nucleotide -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.Linux.gz,gunzip,f,xtract.Linux.gz,chmod,x,xtract.Linux -rettype fasta -retmode text -api_key <SCRUBBED-BY-CURTIS> -tool edirect -edirect 17.1 -edirect_os Linux -email curtis.kapsak@theiagen.com
EMPTY RESULT
SECOND ATTEMPT

Looks like we're missing the xtract executable based on the cryptic/hard-to-read error message. I'm trying to install it manually, but it's a PITA trying to run nquire in the docker image build. Can't figure out why it won't download the xtract binary.

trying to add this to the dockerfile, but it keeps failing since the nquire command doesn't actually download the binary for xtract

RUN /edirect/nquire -dwn ftp.ncbi.nlm.nih.gov entrez/entrezdirect xtract.Linux.gz 2>&1 && ls -a && find / -name "xtract*" && \
 gunzip -f xtract.Linux.gz && \
 mv -v xtract.Linux /usr/local/bin/xtract && \
 chmod +x /usr/local/bin/xtract

docker build output:

RUN /edirect/nquire -dwn ftp.ncbi.nlm.nih.gov entrez/entrezdirect xtract.Linux.gz 2>&1 && ls -a && find / -name "xtract*" &&  gunzip -f xtract.Linux.gz &&  mv -v xtract.Linux /usr/local/bin/xtract &&  chmod +x /usr/local/bin/xtract
 ---> Running in 12d6fb7ae634
.
..
vocvoi-output
/edirect/xtract
/edirect/cmd/xtract.go
/edirect/help/xtract-unix.txt
/edirect/help/xtract-examples.txt
/edirect/help/xtract-help.txt
/edirect/help/xtract-keys.txt
/edirect/help/xtract-internal.txt
gzip: xtract.Linux.gz: No such file or directory
The command '/bin/sh -c /edirect/nquire -dwn ftp.ncbi.nlm.nih.gov entrez/entrezdirect xtract.Linux.gz 2>&1 && ls -a && find / -name "xtract*" &&  gunzip -f xtract.Linux.gz &&  mv -v xtract.Linux /usr/local/bin/xtract &&  chmod +x /usr/local/bin/xtract' returned a non-zero code: 1
lskatz commented 2 years ago

Thanks for humoring me. I spent some time going over and over again for a new version v0.7.2 and I think it is all fixed? Can you try again? https://github.com/CDCgov/datasets-sars-cov-2/releases/tag/v0.7.2

kapsakcj commented 2 years ago

yes, will try again when I have some time early next week. Thanks for working on the checksums! I really want to get this working as intended!

kapsakcj commented 2 years ago

ok, just opened a PR with my latest dockerfile for 0.7.2. The upgrade to v0.7.2 as well as the addition of the NCBI xtract tool resolved the wget errors as well as the checksum mismatch errors.

The SNF-A dataset will take a long time to finish downloading/compressing/checksum verfication, but we should see everything pass here 🤞 https://github.com/StaPH-B/docker-builds/runs/6865408463?check_suite_focus=true

lskatz commented 2 years ago

Caught by the new sra toolkit. I'll have to update versions.

lskatz commented 2 years ago

😩 which means updating the sratoolkit container....

kapsakcj commented 2 years ago

which means updating the sratoolkit container

I don't want to go down that road again. It's not fun or easy ☹️

I would try NCBI's docker image or try putting a pre-compiled binary in a docker image to keep things simple.

https://github.com/ncbi/sra-tools/wiki/SRA-tools-docker

https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit#the-current-binaries-for

kapsakcj commented 2 years ago

addressed via #387 . Tests pass when building the docker image locally, but fail on GH Actions runners due to the sheer size of the fastq files from the SNF-A dataset. The hard drive fills up and runs out of space before the test finishes.

docker image for 0.7.2 is now deployed to dockerhub and quay under the latest and 0.7.2 docker image tags: https://hub.docker.com/r/staphb/datasets-sars-cov-2/tags https://quay.io/repository/staphb/datasets-sars-cov-2?tab=tags