Closed GavinHuttley closed 9 months ago
@fredjaya can you please check on downloading data given a file containing a list of genbank accession IDs
record instructions and post them in here
To download data given a file containing a list of genbank accession IDs:
datasets download genome accession --inputfile accessions.txt
Overall, quick and easy to use, but a bit under-documented.
conda create -n ncbi_datasets -c conda-forge ncbi_datasets-cli
Alternatively, you can download pre-compiled binaries via curl/wget/download link on from the webpage.
You can get a file of accession numbers for a taxon via datasets summary
.
For example, datasets summary genome taxon ecoli
will print a JSON of the metadata for all E. coli genomes to stdout (default) and then filter using jr
. More information on this biostars post.
When downloading a lot of genome data (>1000 genomes, or >15GB), the recommendation is to use the --dehydrated
flag to download a small .zip
file, then get the full genome data package.
More information:
The CLI only works for data on NCBI Datasets. Therefore, accessions on e.g. nuccore
are not compatible.
Thanks for this @fredjaya ! Just to clarify, for
$ datasets download genome accession --inputfile accessions.txt
to work, accessions.txt
must contain NCBI Dataset ID's?
That's correct. Specifically,
datasets download genome accession --inputfile accessions.txt
is compatible with only NCBI Datasets Genome IDs.
well that's crippling.
What does it do if some ID's are NCBI Datasets Genome ID and others are not? (Does it skip the errors gracefully by downloading the valid IDs and reporting the invalid IDs?)
Given cat accessions.txt
:
EU244602.1 (nuccore)
GCA_000005845.2 (datasets genome)
$ datasets download genome accession --inputfile accessions.txt
>Error: invalid or unsupported assembly accession: >EU244602.1
>Use datasets download genome accession <command> -->help for detailed help about a command.
Looks like it terminates at any non-dataset genome IDs
Alternatively looping through each accession works fine
for i in `cat accessions.txt`; do
datasets download genome accessions $i
done
Just need to include a dynamic output file name so it doesn't overwrite each download (defaults to ncbi_dataset.zip
). e.g. add flag --filename $i
so there's a workaround, but it's work for the user. The above is very useful info which we need to include in the section on grabbing content from NCBI.
And in general, masking errors is a bad idea (the bash loop only works because it ignores return codes).
A summary of this content needs to go into #25. Reference this issue there and then close this issue.
@khiron for installing the Linux binary in the Docker image
$ wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
$ chmod +x ./datasets
https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/