cogent3 / Cogent3Workshop

Materials for the Phylomania workshop
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

LO - 2 - Getting Data #43

Closed KatherineCaley closed 8 months ago

KatherineCaley commented 8 months ago

Draft content for block 2 of the workshop.

superset of issues #24, #25, #26.

Getting data from:

KatherineCaley commented 8 months ago

Getting data πŸ«΄πŸ’Ύ

In comparative genomics and phylogenetics studies, we are often required to use previously published data. However, obtaining the data needed, such as the sequence alignments used in a published paper, may not be straightforward.

In this section, you will use the RefSoil reference database (Choi et al., 2016) as an example to explore the various obstacles to reproducibility and data sharing.

Lastly, you will circumnavigate these obstacles by utilising command-line tools developed by the cogent3 dev team, including:

  1. A Python script to efficiently download hundreds of GenBank files from NCBI.
  2. EnsemblLite - a tool to retrieve valuable data, such as alignments, from Ensembl

πŸ“ƒ The GenBank file

Imagine you are... πŸ’­

...part of a research team investigating soil microbial communities and their roles in nutrient cycling and plant growth. The RefSoil database, with its comprehensive collection of soil microbial genomes, becomes a crucial resource for your study.

🏁 Goal
Understand the sequence and annotation components of a GenBank file, and how these are important to serve your own research.

RefSoil consists of genome sequences and annotations for numerous soil bacteria and archaea. Annotations, or in GenBank files - "features", inform which parts of the sequences are protein coding genes, non-coding regions, and so on.

This information makes GenBank files an invaluable asset for your research, such as for identifying and retrieving the same genes across microbes that may contribute to plant growth.

Let's have a brief look at the GenBank file format.

The GenBank file format πŸ“ƒ

We will diverge from RefSoil and have a look at an example of the GenBank file for a tardigrade 18S rRNA sequence.

The key components of a GenBank file include:

image

For more information about the specific components of a GenBank file, see this GenBank Sample Record.

🧬 Obtaining the RefSoil GenBank sequences

Shifting back to RefSoil, we will now explore the various ways of obtaining the reference sequence data. The authors have suggested several options to access the reference sequences from GenBank including:

The following sections include screen casts which show the process of retrieving the data using the suggested methods (plot twist - they don't work as expected).

🏁 Goal
Be familiar with the potential hurdles in retrieving published data

Downloading sequences via figshare πŸ“

[figshare.mp4] One approach is to access the raw sequencing data via the figshare data repository. This way requires downloading the supplementary information (.docx) from the paper, navigating to the GitHub page using the link in the document, then to figshare using the link on the GitHub page... a bit convoluted. Then, the compressed sequence file was downloaded and attempted to be extracted. Although the video was cut short, you will encounter that the compression step will hang.

Downloading sequences via Python script 🐍

[script.mp4] This approach involves cloning the repository and running the `fetch_genbank.py` script given a list of IDs in `refsoil_id.txt`. In the screen cast, you will see me (Fred) attempt to debug the (mainly python2-related) issues that arise. Instead of continuing to debug the script, we will leverage `refsoil_id.txt` to rapidly download the files in the next section.

Downloading GenBank files with a list of accession IDs (slow) 🌐

NCBI provide some options for how to [download a large, custom set of records from NCBI](https://www.ncbi.nlm.nih.gov/guide/howto/dwn-records/). These include browser-based and command-line options. This last segment will introduce using Batch Entrez to download GenBank files from a browser. ##### πŸ›’ NCBI datasets Before jumping into Batch Entrez, it is worth mentioning the NCBI [`datasets`](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/) command line tool. The goal of `datasets` is to rapidly download (as you guessed) NCBI data, given a list of accession numbers. An example command is: ``` datasets download genome accession --inputfile refsoil_ids.txt ``` Although promising and easy to use, it is restricted to downloading data hosted on "[NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/)". So if you have data on GenBank alone, such as the RefSoil sequences, it will not work. ##### 🌐 Batch Entrez (browser download) [batch.mp4] The best point-and-click option is to use the browser-based [Batch Entrez](https://www.ncbi.nlm.nih.gov/sites/batchentrez) tool. You can simply upload a list of accession IDs (i.e. `refsoil_ids.txt`) and download the GenBank files in bulk. The pros are that it is easy-to-use and does not require scripting. However, downloading from the browser can be time consuming as it retrieves a single, large file and any interruptions (such as network issues) will require downloading the file again from scratch.

πŸš€ Efficient Strategies for Data Retrieval from Sequence Data Sources.

How to download GenBank sequences using a list of IDs 🏎️

Here, we provide a script download_genbank.py to efficiently download GenBank files given a list of accession IDs. The script overcomes some of the previous hurdles and speeds up the process by downloading files in parallel.

Have a go at running this yourself.

Note
Ensure the command is executed from the same directory as download_genbank.py. You may need to run cd genbank_downloader first.

Example usage
- For help and options ```bash $ python download_genbank.py Usage: download_genbank.py [OPTIONS] downloads limit accessions from GenBank Options: -p, --path PATH path to genbank accession file [required] -o, --outdir PATH path to write genbank formatted sequence files [required] -l, --limit INTEGER number of accessions to download [default: 10] --help Show this message and exit. ``` - To download GenBank (.gb.gz) files for the first 10 accessions (default limit) ```bash $ python download_genbank.py -p refsoil_id.txt -o refsoil ``` - To download GenBank (.gb.gz) files for the first 20 accessions ```bash $ python download_genbank.py -p refsoil_id.txt -o refsoil -l 20 ```

πŸ‡ͺ❗Retrieving data from Ensembl with EnsemblLite

GavinHuttley commented 8 months ago

Ensembl

This will introduce the (currently under development) command line interface to Ensembl data.

Cannot cover most of the below as not implemented.

Installation

Within the VS Code terminal, with your virtual environment activated, install EnsemblLite as

pip install git+https://github.com/cogent3/EnsemblLite.git

Notes

We will cover:

fredjaya commented 8 months ago

Downloading GenBank files from a list of accession numbers

download_genbank.py takes a file with a list of NCBI accession numbers and downloads the associated GenBank file. Files are downloaded asynchronously for speed.

Installation

Create venv

$ mkdir venv
$ python3 -m venv venv/
$ source venv/bin/activate
$ pip install cogent3 rich click unsync

Example usage

Options: -p, --path PATH path to genbank accession file [required] -o, --outdir PATH path to write genbank formatted sequence files [required] -l, --limit INTEGER number of accessions to download [default: 10] --help Show this message and exit.


- To download GenBank (.gb.gz) files for the first 10 accessions (default limit)

$ python3 download_genbank.py -p refsoil.txt -o refsoil



### Notes and attachments

I commented out the use of `rich.track()` because the progress bar goes straight to 100% when using unsync.

[download_genbank.tar.gz](https://github.com/cogent3/Cogent3Workshop/files/13342880/download_genbank.tar.gz)

@rmcar17 
rmcar17 commented 8 months ago

Slightly revamped genbank downloader with modified README based on the above by Fred. genbank_downloader.zip

Important @khiron click needs to be added to the python pip installation in the docker file.

Instructions for wiki (taken from the README and based on Fred's):

Downloading GenBank files from a list of accession numbers

download_genbank.py takes a file with a list of NCBI accession numbers and downloads the associated GenBank file. Files are downloaded asynchronously for speed.

Example usage

Options: -p, --path PATH path to genbank accession file [required] -o, --outdir PATH path to write genbank formatted sequence files [required] -l, --limit INTEGER number of accessions to download [default: 10] --help Show this message and exit.


- To download GenBank (.gb.gz) files for the first 10 accessions (default limit)
```bash
$ python download_genbank.py -p refsoil_id.txt -o refsoil

Note Ensure the command is executed from the same directory as download_genbank.py. You may need to run cd genbank_downloader first.

fredjaya commented 8 months ago

I'll do the screen grab of trying to get the refsoil data

fredjaya commented 8 months ago

Screen casts for the figshare and python script methods on Dropbox.

GavinHuttley commented 8 months ago

@rmcar17 and @fredjaya rather than remove using track(), change usage to track(..., transient=True) and the progress bar disappears when completed. Why do this? If someone on a slow network, that progress bar is useful!

rmcar17 commented 8 months ago

@GavinHuttley below is a modified version of the genbank downloader using click's progressbar in a thread safe manner. Our initial experience was that rich's track did not work with unsync.

genbank_downloader.zip