A workflow for getting information for genes based on RNA-seq data. This workflow was built using Snakemake, a pythonic workflow system, and uses Python, Shell and R scripts for its various tasks. The workflow gathers the following information for genes and returns it in a human-readable report:
Snakemake can be run on both Linux and Windows machines, though we find it more convenient to work with a Linux machine. If you are on a Windows platform, don't worry there are multiple solutions:
In this readme we'll cover the setup for both a full Linux OS and a Vagrant Linux VM
Currently we are experiencing some difficulties trying to run shell scripts in a vagrant environment
Download the right package for your operating system
Run the installer and check if vagrant installed correctly
> vagrant
> git clone https://github.com/DaanJG98/RNA-Seq-Workflow.git
> vagrant init hashicorp/precise64
> vagrant up
To get into the virtual machine type:
> vagrant ssh
Make sure you are into your Vagrant VM when performing the next steps
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
You have to open a new terminal inorder to use conda, make sure to logged in to your virtual environment.
$ conda env create --name {your-environment-name} --file /vagrant/environment.yaml
To activate your environment:
$ source activate {your-environment-name}
To deactivate simply:
$ source deactivate
snakemake
command$ cd /vagrant
$ snakemake {optional parameters}
In order to run the complete workflow, $ snakemake data/RNA-seq-ids.txt
must be ran previously
If you like to see the DAG of the workflow, use the $snakemake --dag
command
For saving the DAG to a file use: $snakemake --dag | dot -Tpdf > dag.pdf
For more info about snakemake check the docs at: snakemake docs
> git clone https://github.com/DaanJG98/RNA-Seq-Workflow.git
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
$ conda env create --name {your-environment-name} --file {path-to-file}/environment.yaml
To activate your environment:
$ source activate {your-environment-name}
To deactivate simply:
$ source deactivate
snakemake
command$ snakemake {optional parameters}
In order to run the complete workflow, $ snakemake data/RNA-seq-ids.txt
must be ran previously
If you like to see the DAG of the workflow, use the $snakemake --dag
command
For saving the DAG to a file use: $snakemake --dag | dot -Tpdf > dag.pdf
For more info about snakemake check the docs at: snakemake docs
RNA-seq data should be provided in the following, tab separated, format:
ID | Condition 1 | Condition ... |
---|---|---|
gene A | {value} | ... |
gene B | {value} | ... |
... | ... | ... |
conv_kegg_ids | |
---|---|
Input | RNA-seq-ncbi-ids.txt |
Output | RNA-seq-conv-kegg.txt |
Script | conv_kegg.sh |
Description | Pass NCBI IDs to KEGG REST API and return corresponding KEGG IDs. |
create_gc_graphs | |
---|---|
Input | RNA-seq-sequences.txt, RNA-seq-ids.txt |
Output | {gene}.png |
Script | create_graph.R |
Description | Create graph showing GC% and AT% for each gene. |
filter_ids | |
---|---|
Input | RNA-Seq-counts.txt |
Output | RNA-seq-ids.txt |
Script | In-rule Python script |
Description | Filter IDs out of input file. |
get_gene_info | |
---|---|
Input | RNA-seq-ncbi-ids.txt |
Output | RNA-seq-gene-info.txt |
Script | In-rule Python script |
Description | Fetch gene data from Entrez, make selection of specific attributes and return these values. |
get_genes_per_pubmed | |
---|---|
Input | RNA-seq-gene-info.txt, RNA-seq-ids.txt |
Output | RNA-seq-ids.txt |
Script | In-rule Python script |
Description | Get per PubMed article the corresponding genes. |
get_kegg_ids | |
---|---|
Input | RNA-seq-conv-kegg.txt |
Output | RNA-seq-kegg-ids.txt |
Script | In-rule Python script |
Description | Pass KEGG IDs to Bio.KEGG REST and return corresponding pathway IDs. |
get_ncbi_ids | |
---|---|
Input | RNA-seq-ids.txt |
Output | RNA-seq-ncbi-ids.txt |
Script | get_ncbi_ids.sh |
Description | Pass IDs from input to Entrez and return corresponding NCBI IDs. |
get_orthologs | |
---|---|
Input | RNA-seq-ids.txt |
Output | RNA-seq-orthologs.txt |
Script | get_omadb_orthologs.sh |
Description | Pass IDs from input to OMA Browser API and return IDs of corresponding orthologs. |
get_sequence | |
---|---|
Input | RNA-seq-gene-info.txt |
Output | RNA-seq-sequences.txt |
Script | In-rule Python script |
Description | Fetch specific sequence in genome from Entrez, calculate GC-percentage and return these values. |
report | |
---|---|
Input | RNA-seq-ids.txt, RNA-seq-ncbi-ids.txt, RNA-seq-gene-info.txt, RNA-seq-sequences.txt, RNA-seq-kegg-ids.txt, RNA-seq-genes-per-pubmed.txt, RNA-seq-orthologs.txt |
Output | report.html |
Script | report_parser.py |
Description | Parse all output data from previous rules into a single report. |