Conservation across sequences

iamciera commented 4 years ago

Goal: Learn how to use phastcons for scoring conservation.

Conservation is basically a score that estimates how similar the sequences are. Phastcons was the only package that incorporated a tree to accomplish this, so I think we should start here and possibly down the line, implement our own algorithm. But first, we need to understand what their program does. The best way to accomplish this is to approach this is go through tutorials online, apply our data to it, and read papers that use it.

We want to look at "conservation score" across 1.) whole alignment files and 2.) specific motifs 3.) Learn the best way to visualize scores

1. Whole Alignment Files

We want to accomplish conservation on the entire alignment file, as a way to normalize the dataset and get a global view of each region. For example, ask questions like "Are the enhancers with positive function (enhancer_func == 1) more conserved that those that are not (enhancer_func == 0)?

The alignment files are located here: https://drive.google.com/open?id=1UEXg0QMDFKIrvwnTxo64t2AWseYOCfD9
The trees is located here: https://github.com/DiscoveryDNA/montium_5_TFBS_evolution/tree/master/data/tree.

The best tutorial I found is below, but feel free to look around. I will add more on this issue.

Dave Tang tutorial - feel free to do his tutorial in entirety to see if you can get it to work. His sequences are retrieved differently and he is using .maf files, while we will be using .fasta files for the alignment input.

2. Single Motifs

I am not sure the best way to approach this, it could be the same as doing the whole alignment, but I will have to look into it more. Please fee free to explore this yourself also.

PhastCons question about short sequences

3. Visualize

At this point the best way to accomplish this is to play around with the data while doing the tutorial, but make sure you are paying attention and making notes on possible options while doing your work.

Other resources

Please use comments below to add resources, ask questions, and provide comments relating to conservation.

ZLoeiu commented 4 years ago

Tutorial: https://cran.r-project.org/web/packages/rphast/vignettes/vignette1.pdf

iamciera commented 4 years ago

## For switching out the species names in other files
species_key <- read.csv("../data/montium_species_laneID.csv")
species <- species_key$species
lane <- species_key$lane_ID

## replace all the lane IDs with species names
for(j in seq_along(lane)){
  dataset$species <- gsub(lane[j], species[j], dataset$species)
}

ZLoeiu commented 4 years ago

Hi Ciera the ID's you gave me don't match the tree names. The formatting of the species names are different and some names are missing from both tree files. For example, D. Baimaii is in the dictionary of names but not in any of the tree files.

ZLoeiu commented 4 years ago

The vignette I linked uses a genepred file of gene annotations to get their neutral model and I was wondering if we needed this file

iamciera commented 4 years ago

Force the species to match alignment names
Make the neutral model based on one region
Get to the conservation visualization on one region, no matter how weird the pipeline is, so we can use the visualization to see how the conservation changes depending on which neutral model we use.
We will go back and try different to make the neutral model on different regions and see how this affects the conservation model.

iamciera commented 4 years ago

December 13, 2019

phastweb: GUI that essentially mimics script and command line workflow.

Link: https://github.com/DiscoveryDNA/montium_5_TFBS_evolution/blob/master/R/phastcons.Rmd

Try different neutral models to see if the conservation changes. Including - 27way provided in phastweb (find file), make neutral file based on a larger alignment of our sequences.
Double check all arguments used and make sure that they make sense to our project - @iamciera. Parameters are default from fastweb
Next main steps: create pipeline to run on the rest of the alignments
Trees: Missing species in trees is a problem. Email Turelli lab for most up to date tree, @iamciera, needs to make a whole genome tree with alignment files.

DiscoveryDNA / montium_5_TFBS_evolution