R-ODAF / R-ODAF_Health_Canada

Health Canada's version of the R-ODAF pipeline, which includes additional visualization and pipelining features
MIT License
2 stars 3 forks source link

Omics Data Analysis Frameworks for Regulatory application (R-ODAF)

Issues Pull Requests Repository Size Languages License Tests



Python R Tests

What is the R-ODAF?

It is an analysis framework geared toward the use of 'omics data in toxicology experiments. It aims to use commonly available open-source tools for analysis of transcriptomic data, specifically using DESeq2 for the determination of differentially expressed genes (DEGs).

This R-ODAF repository, which was forked from the main branch, is under development by researchers at Health Canada and has been used primarily in the context of regulatory toxicology. The intended use is for processing high-throughput sequencing data from FASTQ files to DEG lists, and to generate exploratory analyses for use in toxicology research.

What does this repository do?

This repository can be used to analyze transcriptomic datasets generated by either TempO-seq or RNA-seq. It generates:

What does this repository NOT do?

This repository provides no warranty or guarantees on the quality of your analysis. It is used actively in research projects and is not intended to provide any type of authoritative assessment of your data.

There may be other tools that are more applicable to your goals. Indeed, if you are not undertaking toxicogenomics experiments, then that is likely the case.

How can I use this repository?

In its current state of development, this repository should be used by individuals (or teams) who are experienced with running bioinformatic analysis in a linux environment. Proficiency in R is also very helpful. It is necessary to understand scripting and how to process data. We encourage users to learn if they are interested in pursuing this type of data analysis; to that end, we provide a users guide below detailing the steps required to process data, and best practices for using this analysis pipeline.

Annotation versions

For reasons of reproducibility, this respository uses the following annotation versions:

org.Hs.eg.db 3.14.0 (https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html)

org.Mm.eg.db 3.14.0 (https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html)

Installation and getting started

Cloning the repository

You should start by cloning the repository locally, to a computer that can handle analysis of your data and where your raw data is locally accessible.

You can clone the repository using this command:

git clone https://github.com/R-ODAF/R-ODAF_Health_Canada.git

Note: Working in RStudio

Typically, when undertaking a new analysis, it is a good idea to have a separate folder for each project. This makes it easier to manage, and also makes your code more portable. Working in RStudio facilitates these tasks.

Therefore, it is a good idea to open the cloned repository in RStudio as a project (there is already a project file by default in this repository, so RStudio should recognize it). This tells RStudio where to look for files, which is better than setting a working directory.

Note that R Studio Server has a built-in terminal in which unix commands can be run. However, you should use the command line in a separate terminal rather than using the one in R Studio Server. This is because Rstudio will try to use packages installed in user directories rather than those installed in conda environments, causing the pipeline run to abort.

Dependencies

We recommend installing snakemake within its own conda environment to avoid conflicts.

NOTE on Snakemake versions : Snakemake v8.0 and above introduced changes to the snakemake command used to run the workflow. The commands given in this README apply to v8.0 and above. If you are using an earlier version of Snakemake, replace --software-deployment-method conda with --use-conda For more information: https://snakemake.readthedocs.io/en/stable/project_info/history.html#detailed-breaking-changes

If you do not yet have mamba, install it.

conda install -n base -c conda-forge mamba

Follow the installation steps provided in the snakemake documentation https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#full-installation

conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake

To activate this environment:

conda activate snakemake

Other dependencies are managed by snakemake via the creation of conda environments. The conda environments needed for various steps of the workflow are defined in a series of .yml files in workflow/envs.

Installing environments

Environments must be installed prior to running the workflow because of additional dependencies that must be added to the R-ODAF_reports environment via the install.R script included in this repo. If you use snakemake to run the full workflow without first installing and fixing the environments, the workflow will not run to completion. This onlly needs to be done once per project, or once ever if you install environments in a persistent location.

Note on Snakemake versions :

Snakemake v8.0 and above introduced changes to the snakemake command used to run the workflow. The commands given in this README apply to v8.0 and above. If you are using an earlier version of Snakemake, replace --software-deployment-method conda with --use-conda

From inside the repo directory, run:

snakemake --cores 32 --software-deployment-method conda --conda-create-envs-only

Choose the number of cores according to what is available in your system.

Then install additional software in the reports environment:

conda run -p $(grep -rl "R-ODAF_reports" .snakemake/conda/*.yaml | sed s/\.yaml//) Rscript install.R

Persistent environments

If you intend to run this analysis on multiple datasets, it can be advantageous to instead install the environments in a persistent location, and point to this location with snakemake's --conda-prefix flag. This avoids having to install the environments multiple times, which saves both time and disk place.

For example, the code below installs the environments in ~/snakemake_envs. Using this method, every time the analysis is run with snakemake, the command must include --conda-prefix ~/snakemake_envs

mkdir ~/snakemake_envs #Place and name this directory as you wish, and use correctly later with --conda-prefix

snakemake --cores 32 --software-deployment-method conda --conda-create-envs-only --conda-prefix ~/snakemake_envs/

conda run -p $(grep -rl "R-ODAF_reports" ~/snakemake_envs/*.yaml | sed s/\.yaml//) Rscript install.R

When running the workflow, remember to use the --conda-prefix flag.

Set up inputs

More details as well as template files can be found in the inputs subdirectories.

Config.yaml

The config file is the control centre for the workflow. An up-to-date default config is provided in the repo, and explanations of some of the commonly-adjusted parameters are provided in the readme in the config directory.

Reference genome files

You must provide a reference sequence for alignment of the raw reads. For TempO-Seq experiments, these are provided by BioSpyder; for RNA-seq experiments they can be obtained through your favorite database. This is currently beyond the scope of this guide. Ensure you have some type of annotation file (GTF format) available as well, to dictate which sequences in the FASTA file correspond to which genes or probes. These will be used to create a STAR index within the directory where you store your reference genome.

Sample names

A column labeled sample_ID must be included in the metadata table, containing a unique identifier for each sample. This identifier must match the fastq.gz file names in the following way:

For single-end sequencing: {sample_ID}.fastq.gz

For paired-end sequencing: {sample_ID}.R1.fastq.gz and {sample_ID}.R2.fastq.gz

Defining samples to remove from analysis

If there are samples in your inputs/raw/ directory that you want to exclude from QC and differential expression analysis (ex. samples that were exposed to cytotoxic doses based on laboratory tests), create a file named "remove.txt" in the inputs/metadata/ directory. In this file, give the sample names, one per line, matching the sample_ID column in the metadata file. These samples will still go through preprocessing (trimming and alignment) and appear in the count table, but will be removed during QC. A table of manually removed samples will appear in the QC report.

Run the workflow

Performing a dry run

Note on Snakemake versions

We strongly recommend using snakemake's dry run functionality to check for errors before beginning an actual run. (See https://snakemake.readthedocs.io/en/v7.32.2/executing/cli.html).

Use the flags --dry-run, or -n

For example, snakemake -s ./workflow/Snakefile --cores 32 --software-deployment-method conda -n

Running the workflow

If you installed the conda environments in a persistent location, remember to use the --conda-prefix flag.

To run the whole pipeline:

snakemake --software-deployment-method conda --cores 32

Choose the number of cores according to what is available in your system.

pipeline_image

Running workflow sections

This workflow is modularized, so you can run (or re-run) some sections independently if needed. This can be beneficial if, for example:

Note that you need to run the pipeline steps in order; for example, you cannot do the quality control step if outputs from the preprocessing step do not yet exist.

Preprocessing only

To run only the preprocessing steps and get a count table for TempO-Seq datasets:

snakemake -s ./workflow/modules/4-preprocess_temposeq.smk --software-deployment-method conda --cores 32

To run preprocessing steps and get a count table for RNA-Seq datasets:

snakemake -s ./workflow/modules/4-preprocess_rnaseq.smk --software-deployment-method conda --cores 32

Quality control

Once the preprocessing steps are complete, you can run (or re-run) the quality control steps. These apply the R-ODAF filters on all samples and produce a multiQC report, a study-wide report on quality control filters according to R-ODAF criteria, and a filtered metadata file.

snakemake -s ./workflow/modules/5-qc.smk --software-deployment-method conda --cores 32

Differential expression analysis

Once the preprocessing and quality control steps are complete, you can run (or re-run) differential analysis via DESeq2 and production of DEG reports. This will produce DEG lists, exploratory reports, and interactive gene expression reports that incorporate DEG filtering according to R-ODAF criteria.

snakemake -s ./workflow/modules/6-diffexp.smk --software-deployment-method conda --cores 32

Note that you will have to delete dummy output file "reports_complete" to re-run the differential expression analysis step.

Output files

Quality control

Differentially expressed genes

Each analysis folder contains the following subdirectories: reports, DEG_lists, RData, Pipeline_record, (optional) BMD_and_biomarker_files, (optional) pathway_analysis.

reports contains html reports:

__DEG_lists__ contains lists of differentially expressed genes, as well as DEG information formatted to be used as input for further analysis tools (ex. IPA). If you specified DEseq_facets in the config file, subdirectories will be made for each facet, each containing:

RData contains an .RData file that can be used to re-import results to R for further analysis if needed. It also contains records of the contrasts and config file used.

__Pipeline_record__ contains copies of the config file, contrasts file, and reports environment used in the analysis. This is helpful in keeping track of the parameters used when running multiple analyses.

__BMD_and_biomarker_files__ contains files formatted for use in bmdexpress and various biomarker analysis (ex. TGx-DDI, TGx-HDACi)

__pathway_analysis__ contains reports on pathway analysis

Tips and tricks

Troubleshooting

Questions?

I'm not surprised if you have questions. This is very much a draft of the workflow, and is incomplete except for serving highly experienced users. Please reach out if you have questions or issues, and feel free to submit an issue or PR with any suggestions you might have.