cogent3 / Cogent3Workshop

Materials for the Phylomania workshop
BSD 3-Clause "New" or "Revised" License
8 stars 4 forks source link

LO - 2 - explore the ncbi command line tools #26

Closed GavinHuttley closed 9 months ago

GavinHuttley commented 10 months ago

https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

GavinHuttley commented 10 months ago

@fredjaya can you please check on downloading data given a file containing a list of genbank accession IDs

record instructions and post them in here

fredjaya commented 10 months ago

tl;dr

To download data given a file containing a list of genbank accession IDs:

datasets download genome accession --inputfile accessions.txt

Overall, quick and easy to use, but a bit under-documented.

Long way

Installation

conda create -n ncbi_datasets -c conda-forge ncbi_datasets-cli

Alternatively, you can download pre-compiled binaries via curl/wget/download link on from the webpage.

Example data

You can get a file of accession numbers for a taxon via datasets summary.

For example, datasets summary genome taxon ecoli will print a JSON of the metadata for all E. coli genomes to stdout (default) and then filter using jr. More information on this biostars post.

Downloading "dehydrated" data

When downloading a lot of genome data (>1000 genomes, or >15GB), the recommendation is to use the --dehydrated flag to download a small .zip file, then get the full genome data package.

More information:

Notes

The CLI only works for data on NCBI Datasets. Therefore, accessions on e.g. nuccore are not compatible.

GavinHuttley commented 10 months ago

Thanks for this @fredjaya ! Just to clarify, for

$ datasets download genome accession --inputfile accessions.txt

to work, accessions.txt must contain NCBI Dataset ID's?

fredjaya commented 10 months ago

That's correct. Specifically, datasets download genome accession --inputfile accessions.txt is compatible with only NCBI Datasets Genome IDs.

GavinHuttley commented 10 months ago

well that's crippling.

GavinHuttley commented 10 months ago

What does it do if some ID's are NCBI Datasets Genome ID and others are not? (Does it skip the errors gracefully by downloading the valid IDs and reporting the invalid IDs?)

fredjaya commented 10 months ago

Given cat accessions.txt:

EU244602.1 (nuccore)
GCA_000005845.2 (datasets genome)
$ datasets download genome accession --inputfile accessions.txt

>Error: invalid or unsupported assembly accession: >EU244602.1

>Use datasets download genome accession <command> -->help for detailed help about a command.

Looks like it terminates at any non-dataset genome IDs

fredjaya commented 10 months ago

Alternatively looping through each accession works fine

for i in `cat accessions.txt`; do
    datasets download genome accessions $i
done

Just need to include a dynamic output file name so it doesn't overwrite each download (defaults to ncbi_dataset.zip). e.g. add flag --filename $i

GavinHuttley commented 10 months ago

so there's a workaround, but it's work for the user. The above is very useful info which we need to include in the section on grabbing content from NCBI.

GavinHuttley commented 10 months ago

And in general, masking errors is a bad idea (the bash loop only works because it ignores return codes).

GavinHuttley commented 10 months ago

A summary of this content needs to go into #25. Reference this issue there and then close this issue.

fredjaya commented 9 months ago

@khiron for installing the Linux binary in the Docker image

$ wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
$ chmod +x ./datasets