This repository contains the python scripts used to download sequence data of organisms (e.g., Eukaryotes, Prokaryotes) or viruses from KEGG via its APIs. There is a shell script main.sh
to summarize the steps to downalod KEGG sequence data.
The requirements.txt
file contains the necessary packages required to run the code in this repo.
You can install it via:
conda create -y --name KEGG_env python=3.8
conda install -y --name KEGG_env -c conda-forge -c bioconda --file requirements.txt
conda activate KEGG_env
You can simply git clone
this repo to your local computer, and then run:
bash main.sh
Or
You can run the specific python scripts based on your need. There are three python scripts under ./python_scripts
folder:
This script is used to download the organism table and the associated origanisms' RefSeq and GeneBank genomes based on KEGG information. It has the following two parameters:
Example: python ${your_current_path}/python_scripts/extract_kegg_organism_data.py --organisms 'Archaea' 'Bacteria' 'Fungi' --outdir ${your_current_path}/out_results/kegg_organisms
This script is used to download the viruses table and their associated RefSeq and GeneBank genomes based on KEGG information. It has only one parameter:
Example: python ${your_current_path}/python_scripts/extract_kegg_virus_data.py --outdir ${your_current_path}/out_results/kegg_viruses
This script needs to run after either/both of the above two scripts have been implemented. Iis used to download the gene sequences into a fasta-format file. It has only one parameter:
RefSeq
(e.g., "rs_ncbi_seq_ids") or GeneBank
(e.g., "gb_ncbi_seq_id") you want to useExamples:
python ${your_current_path}/python_scripts/download_seq_fasta.py --table ${here}/out_results/kegg_organisms/organism_table.txt --col 'rs_ncbi_seq_ids' --organisms 'Archaea' 'Bacteria' 'Fungi' --outfile ${your_current_path}/out_results/kegg_organisms/rs_ncbi_organism.fasta
python ${your_current_path}/python_scripts/download_seq_fasta.py --table ${here}/out_results/kegg_organisms/organism_table.txt --col 'gb_ncbi_seq_id' --organisms 'Archaea' 'Bacteria' 'Fungi' --outfile ${your_current_path}/out_results/kegg_organisms/gb_ncbi_organism.fasta
You can find the data (only for Archaea' 'Bacteria' 'Fungi' and 'Viruses') that I have already downloaded previously from our GPU server. The data locates /data/shared_data/KEGG_data
.
The script convert_table_to_fasta.py will convert the *.txt
records of gene sequences into FASTA formatted versions of the amino acid and nucleotide sequences.
These are stored at /data/shared_data/KEGG_data/kegg_genes.faa
and /data/shared_data/KEGG_data/kegg_genes.fna
To see how the format conversion works on test data, run the following from the python_scripts
directory:
./convert_table_to_fasta.py --gene_dir ../test_data/input/ --out_dir ../test_data/output/
and you will find the output in /data/output