Closed KatherineCaley closed 1 year ago
In comparative genomics and phylogenetics studies, we are often required to use previously published data. However, obtaining the data needed, such as the sequence alignments used in a published paper, may not be straightforward.
In this section, you will use the RefSoil reference database (Choi et al., 2016) as an example to explore the various obstacles to reproducibility and data sharing.
Lastly, you will circumnavigate these obstacles by utilising command-line tools developed by the cogent3 dev team, including:
...part of a research team investigating soil microbial communities and their roles in nutrient cycling and plant growth. The RefSoil database, with its comprehensive collection of soil microbial genomes, becomes a crucial resource for your study.
π Goal
Understand the sequence and annotation components of a GenBank file, and how these are important to serve your own research.
RefSoil consists of genome sequences and annotations for numerous soil bacteria and archaea. Annotations, or in GenBank files - "features", inform which parts of the sequences are protein coding genes, non-coding regions, and so on.
This information makes GenBank files an invaluable asset for your research, such as for identifying and retrieving the same genes across microbes that may contribute to plant growth.
Let's have a brief look at the GenBank file format.
We will diverge from RefSoil and have a look at an example of the GenBank file for a tardigrade 18S rRNA sequence.
The key components of a GenBank file include:
For more information about the specific components of a GenBank file, see this GenBank Sample Record.
Shifting back to RefSoil, we will now explore the various ways of obtaining the reference sequence data. The authors have suggested several options to access the reference sequences from GenBank including:
The following sections include screen casts which show the process of retrieving the data using the suggested methods (plot twist - they don't work as expected).
π Goal
Be familiar with the potential hurdles in retrieving published data
Here, we provide a script download_genbank.py
to efficiently download GenBank files given a list of accession IDs. The script overcomes some of the previous hurdles and speeds up the process by downloading files in parallel.
Have a go at running this yourself.
Note
Ensure the command is executed from the same directory asdownload_genbank.py
. You may need to runcd genbank_downloader
first.
This will introduce the (currently under development) command line interface to Ensembl data.
Cannot cover most of the below as not implemented.
Within the VS Code terminal, with your virtual environment activated, install EnsemblLite
as
pip install git+https://github.com/cogent3/EnsemblLite.git
We will cover:
download_genbank.py
takes a file with a list of NCBI accession numbers and downloads the associated GenBank file. Files are downloaded asynchronously for speed.
Create venv
$ mkdir venv
$ python3 -m venv venv/
$ source venv/bin/activate
$ pip install cogent3 rich click unsync
For help and options
$ python3 download_genbank.py
Usage: download_genbank.py [OPTIONS]
downloads limit accessions from GenBank
Options: -p, --path PATH path to genbank accession file [required] -o, --outdir PATH path to write genbank formatted sequence files [required] -l, --limit INTEGER number of accessions to download [default: 10] --help Show this message and exit.
- To download GenBank (.gb.gz) files for the first 10 accessions (default limit)
$ python3 download_genbank.py -p refsoil.txt -o refsoil
### Notes and attachments
I commented out the use of `rich.track()` because the progress bar goes straight to 100% when using unsync.
[download_genbank.tar.gz](https://github.com/cogent3/Cogent3Workshop/files/13342880/download_genbank.tar.gz)
@rmcar17
Slightly revamped genbank downloader with modified README based on the above by Fred. genbank_downloader.zip
Important @khiron click
needs to be added to the python pip installation in the docker file.
Instructions for wiki (taken from the README and based on Fred's):
download_genbank.py
takes a file with a list of NCBI accession numbers and downloads the associated GenBank file. Files are downloaded asynchronously for speed.
For help and options
$ python download_genbank.py
Usage: download_genbank.py [OPTIONS]
downloads limit accessions from GenBank
Options: -p, --path PATH path to genbank accession file [required] -o, --outdir PATH path to write genbank formatted sequence files [required] -l, --limit INTEGER number of accessions to download [default: 10] --help Show this message and exit.
- To download GenBank (.gb.gz) files for the first 10 accessions (default limit)
```bash
$ python download_genbank.py -p refsoil_id.txt -o refsoil
$ python download_genbank.py -p refsoil_id.txt -o refsoil -l 20
Note Ensure the command is executed from the same directory as
download_genbank.py
. You may need to runcd genbank_downloader
first.
I'll do the screen grab of trying to get the refsoil data
@rmcar17 and @fredjaya rather than remove using track()
, change usage to track(..., transient=True)
and the progress bar disappears when completed. Why do this? If someone on a slow network, that progress bar is useful!
@GavinHuttley below is a modified version of the genbank downloader using click's progressbar in a thread safe manner. Our initial experience was that rich's track did not work with unsync.
Draft content for block 2 of the workshop.
superset of issues #24, #25, #26.
Getting data from: