CDCgov / datasets-sars-cov-2

Benchmark datasets for WGS analysis of SARS-CoV-2. (https://peerj.com/articles/13821/)
Apache License 2.0
54 stars 18 forks source link

Issue with downloading datasets 2 and 3 #8

Closed eskinner closed 2 years ago

eskinner commented 2 years ago

Hello,

I was able to install and run the script outlined here to download all 6 datasets.

An odd thing is happening for datasets 'sars-cov-2-coronahit-rapid.tsv' and 'sars-cov-2-coronahit-routine.tsv' however.

All of the forward read fastq.gz files are downloading (i.e. NORW-F0A6F_CoronaHiT-ONT_1.fastq.gz) but only some of the corresponding reverse read files are downloading, while most of them are just empty files. Also all of the .fna files are empty.

An example of output:

image

I am wondering if this is normal behavior (if those empty “_2.fastq.gz” files really don’t exist and therefore can’t be downloaded for some samples), or if there is some error.

The forward and reverse files for the other 4 datasets all downloaded fine.

Note: I have Mac OS and am running this while logged into one of our BCM-HGSC login nodes.

Any help would be appreciated!

Thanks, Evette

lskatz commented 2 years ago

Yes I believe that the _2 files are only going to be from the nanopore reads and so you can safely ignore those.

Thank you for bringing it to our attention about the assemblies. It looks like some are only available on ENA and it slipped past us. Some help from NCBI says

ENA is releasing this data both as part of the analysis package (ERZ) and as a traditional GenBank record. The accession for ERZ1690836 is LR963198. If you wish to fetch their sequence accession from an ERZ number in the future, use their xrefs API, e.g.: https://www.ebi.ac.uk/ena/xref/rest/json/search?source=assembly&source_accession=ERZ1690836

Thank you NCBI for this help. I am going to keep this ticket open until we either bring in the appropriate genbank records or when we add the API to the GenFS Gopher script.

lskatz commented 2 years ago

Fixed in v0.5.3 and with #10