UBC-MDS/Taracyc_Ocean_Virus_Analysis

Viral Voyager: Taracyc Ocean Virus Analysis

Authors

Name	CWL
Harjyot Kaur	HarjyotKaur
Heather Van Tassel	heathervant

Overview

One of the most promising places to sequester carbon is in the oceans. The ocean plays a vital dominant role in oxygen production, weather patterns, climate and the global carbon cycle. Cyanobacteria in the oceans digest carbon, and when the bacteria die, this carbon sinks to the bottom of the ocean, thereby sequestering it from our atmosphere. There are viruses that can infect bacteria and alter their chance of survival.

Motivation for research

In 2009, a 3-year voyage around the world began, to collect more information about our precious oceans. The project was led by the TARA oceans project and resulted in the collection of 300 water samples, involving over 150 Scientists who are curious about the biodiversity and distribution of micro-organisms in the oceans. The Hallam lab at UBC has taken these genetic sequences from the viruses and bacteria and created a complex algorithm that classifies the DNA sequences into biological pathways that these genes may be involved in regulating. A team of students and researchers took this dataset and made a shiny app to help the public interact with and explore the data at the University of British Columbia's hackseq 2018. Many questions are waiting to be explored with this dataset, to help characterize genetic diversity of the ocean, and make inferences about how bacteria and viruses interact and how they might be altered by changing climates.

Research Question

Does the mean abundance of viral DNA sequences differ across biological pathways? Does the mean abundance of viral DNA sequences differ across ocean depth levels? Does the mean abundance of viral DNA sequences of the biological pathways differ across ocean depth levels?

Analysis Overview

The goal is to carry out a Two-Way ANOVA (Factorial Analysis) to compare the main effects and interaction effects between biological pathways and ocean depth levels on the abundance of viral DNA sequences.

Variable Name	Type	Description
RKPM	Continuous	Reads per kilobase of transcript per million mapped reads
LEVEL1	Categorical	Biological Pathways
Depth	Categorical	Levels of ocean depth

A detailed report of the analysis is available here.

Usage

There are multiple ways to run the entire analysis:

The foremost step for running the analysis is, Download or clone this Github repository: Taracyc_Ocean_Virus_Analysis

Method 1: Using Docker

Install Docker
Use the command line to navigate to the root of this project directory
Run the following code in terminal to download the Docker image:

docker pull hkaur112/taracyc_ocean_virus_analysis

Type the following code into terminal to run the analysis:

fill in PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project on your computer

docker run --rm -e PASSWORD=test -v PATH_ON_YOUR_COMPUTER:/home/rstudio/taracyc_analysis hkaur112/taracyc_ocean_virus_analysis make -C 'home/rstudio/taracyc_analysis' all

To clean the output of the analysis, type the following code into the terminal:

fill in PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project on your computer

docker run --rm -e PASSWORD=test -v PATH_ON_YOUR_COMPUTER:/home/rstudio/taracyc_analysis hkaur112/taracyc_ocean_virus_analysis make -C 'home/rstudio/taracyc_analysis' clean

Link: Dockerfile

Method 2: Using Make

Use the command line to navigate to the root of this project directory
Type the following code into terminal to run the analysis:

make all

To clean the output of the analysis, type the following code into the terminal:

make clean

Link: Makefile

Dependency diagram of the Makefile

Method 3: Shell Script

Use the command line to navigate to the root of this project directory
Run the following in your command shell:

bash run_all.sh

Link: Shell Script run_all.sh

Detailed WorkFlow

Step 1: Data Load

The first script src/taracyc_data_load.R runs and downloads the data from a URL and stores it in a csv data/taracyc_data.csv.

Step 2: Data Wrangling and Explanatory Data Analysis

The second script src/taracyc_data_explore_clean.R takes output of the first script and runs and explores data while simultaneously producing plots and cleaning data. It produces 5 plots that are stored in results/figures data as .png files. It also creates a csv taracyc_data_cleaned.csv with cleaned data.

Step 3: Data Analysis

The third script src/taracyc_data_analysis.R takes output of the second script and runs a Two-Way Anova on the data and stores it in the csv results/taracyc_results.csv.

Step 4: Compiling Results

The fourth script src/taracyc_results.R takes output of the second script and produces a visual representation of the Two-Way Anova and stores it in results/figures/fig7_results.png

Step 5: Creating Report

The report compiled in doc/taracyc_report.Rmd is rendered as a markdown and html file and stored in doc/ folder.

Dependencies

R version 3.5.1 and R libraries

Library	Version
`tidyverse`	tidyverse_1.2.1
`ggplot2`	ggplot2_3.0.0
`car`	car_3.0-2
`ggpubr`	ggpubr_0.2.999
`rmarkdown`	rmarkdown_1.10
`knitr`	knitr_1.20