RNA-Seq-Workflow

A workflow for getting information for genes based on RNA-seq data. This workflow was built using Snakemake, a pythonic workflow system, and uses Python, Shell and R scripts for its various tasks. The workflow gathers the following information for genes and returns it in a human-readable report:

NCBI ID
Nucleotide sequence and GC-percentage
Literature (PubMed)
KEGG pathways
Orthologs

Get started

Snakemake can be run on both Linux and Windows machines, though we find it more convenient to work with a Linux machine. If you are on a Windows platform, don't worry there are multiple solutions:

Setting up a Linux virtual machine with for example VirtualBox
Using Vagrant to house your virtual machine. ~~We recommend this option if you don't need Linux for other purposes~~

In this readme we'll cover the setup for both a full Linux OS and a Vagrant Linux VM

Setup using Vagrant

Currently we are experiencing some difficulties trying to run shell scripts in a vagrant environment

First you need to download and install Vagrant

Download the right package for your operating system
Run the installer and check if vagrant installed correctly
- > vagrant
Next checkout this git repo in a conveniet directory

> git clone https://github.com/DaanJG98/RNA-Seq-Workflow.git

cd into this directory

Create a 64-bit Ubuntu environment in this directory with:

> vagrant init hashicorp/precise64

> vagrant up

To get into the virtual machine type:

> vagrant ssh

Make sure you are into your Vagrant VM when performing the next steps

Installing Miniconda 3

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

$ bash Miniconda3-latest-Linux-x86_64.sh

You have to open a new terminal inorder to use conda, make sure to logged in to your virtual environment.

Creating the environment using the provided environment file

$ conda env create --name {your-environment-name} --file /vagrant/environment.yaml

To activate your environment:

$ source activate {your-environment-name}

To deactivate simply:

$ source deactivate

Now the workflow can be run by using the snakemake command

$ cd /vagrant

$ snakemake {optional parameters}

In order to run the complete workflow, $ snakemake data/RNA-seq-ids.txt must be ran previously

If you like to see the DAG of the workflow, use the $snakemake --dag command

For saving the DAG to a file use: $snakemake --dag | dot -Tpdf > dag.pdf

For more info about snakemake check the docs at: snakemake docs

Setup using a full Linux machine

Start by checking out this git repo in a new directory

> git clone https://github.com/DaanJG98/RNA-Seq-Workflow.git

Next install Miniconda 3

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

$ bash Miniconda3-latest-Linux-x86_64.sh

Create the environment with the provided environment file

$ conda env create --name {your-environment-name} --file {path-to-file}/environment.yaml

To activate your environment:

$ source activate {your-environment-name}

To deactivate simply:

$ source deactivate

Now the workflow can be run by using the `snakemake` command

$ snakemake {optional parameters}

In order to run the complete workflow, $ snakemake data/RNA-seq-ids.txt must be ran previously

If you like to see the DAG of the workflow, use the $snakemake --dag command

For saving the DAG to a file use: $snakemake --dag | dot -Tpdf > dag.pdf

For more info about snakemake check the docs at: snakemake docs

Data format

RNA-seq data should be provided in the following, tab separated, format:

ID	Condition 1	Condition ...
gene A	{value}	...
gene B	{value}	...
...	...	...

Snakefile rules documentation

conv_kegg_ids
Input	RNA-seq-ncbi-ids.txt
Output	RNA-seq-conv-kegg.txt
Script	conv_kegg.sh
Description	Pass NCBI IDs to KEGG REST API and return corresponding KEGG IDs.

create_gc_graphs
Input	RNA-seq-sequences.txt, RNA-seq-ids.txt
Output	{gene}.png
Script	create_graph.R
Description	Create graph showing GC% and AT% for each gene.

filter_ids
Input	RNA-Seq-counts.txt
Output	RNA-seq-ids.txt
Script	In-rule Python script
Description	Filter IDs out of input file.

get_gene_info
Input	RNA-seq-ncbi-ids.txt
Output	RNA-seq-gene-info.txt
Script	In-rule Python script
Description	Fetch gene data from Entrez, make selection of specific attributes and return these values.

get_genes_per_pubmed
Input	RNA-seq-gene-info.txt, RNA-seq-ids.txt
Output	RNA-seq-ids.txt
Script	In-rule Python script
Description	Get per PubMed article the corresponding genes.

get_kegg_ids
Input	RNA-seq-conv-kegg.txt
Output	RNA-seq-kegg-ids.txt
Script	In-rule Python script
Description	Pass KEGG IDs to Bio.KEGG REST and return corresponding pathway IDs.

get_ncbi_ids
Input	RNA-seq-ids.txt
Output	RNA-seq-ncbi-ids.txt
Script	get_ncbi_ids.sh
Description	Pass IDs from input to Entrez and return corresponding NCBI IDs.

get_orthologs
Input	RNA-seq-ids.txt
Output	RNA-seq-orthologs.txt
Script	get_omadb_orthologs.sh
Description	Pass IDs from input to OMA Browser API and return IDs of corresponding orthologs.

get_sequence
Input	RNA-seq-gene-info.txt
Output	RNA-seq-sequences.txt
Script	In-rule Python script
Description	Fetch specific sequence in genome from Entrez, calculate GC-percentage and return these values.

report
Input	RNA-seq-ids.txt, RNA-seq-ncbi-ids.txt, RNA-seq-gene-info.txt, RNA-seq-sequences.txt, RNA-seq-kegg-ids.txt, RNA-seq-genes-per-pubmed.txt, RNA-seq-orthologs.txt
Output	report.html
Script	report_parser.py
Description	Parse all output data from previous rules into a single report.

DaanJG98 / RNA-Seq-Workflow

readme

RNA-Seq-Workflow

Get started

Setup using Vagrant

First you need to download and install Vagrant

Next checkout this git repo in a conveniet directory

cd into this directory

Create a 64-bit Ubuntu environment in this directory with:

Installing Miniconda 3

Creating the environment using the provided environment file

Now the workflow can be run by using the snakemake command

Setup using a full Linux machine

Start by checking out this git repo in a new directory

Next install Miniconda 3

Create the environment with the provided environment file

Now the workflow can be run by using the snakemake command

Data format

Snakefile rules documentation

Now the workflow can be run by using the `snakemake` command

Now the workflow can be run by using the `snakemake` command