DiscoveryDNA / montium_5_TFBS_evolution

analysis of TFBS data from montium genomes
0 stars 0 forks source link

Conservation across sequences #1

Open iamciera opened 4 years ago

iamciera commented 4 years ago

Goal: Learn how to use phastcons for scoring conservation.

Conservation is basically a score that estimates how similar the sequences are. Phastcons was the only package that incorporated a tree to accomplish this, so I think we should start here and possibly down the line, implement our own algorithm. But first, we need to understand what their program does. The best way to accomplish this is to approach this is go through tutorials online, apply our data to it, and read papers that use it.

We want to look at "conservation score" across 1.) whole alignment files and 2.) specific motifs 3.) Learn the best way to visualize scores

1. Whole Alignment Files

We want to accomplish conservation on the entire alignment file, as a way to normalize the dataset and get a global view of each region. For example, ask questions like "Are the enhancers with positive function (enhancer_func == 1) more conserved that those that are not (enhancer_func == 0)?

The best tutorial I found is below, but feel free to look around. I will add more on this issue.

2. Single Motifs

I am not sure the best way to approach this, it could be the same as doing the whole alignment, but I will have to look into it more. Please fee free to explore this yourself also.

3. Visualize

At this point the best way to accomplish this is to play around with the data while doing the tutorial, but make sure you are paying attention and making notes on possible options while doing your work.

Other resources

Please use comments below to add resources, ask questions, and provide comments relating to conservation.

ZLoeiu commented 4 years ago

Tutorial: https://cran.r-project.org/web/packages/rphast/vignettes/vignette1.pdf

iamciera commented 4 years ago
## For switching out the species names in other files
species_key <- read.csv("../data/montium_species_laneID.csv")
species <- species_key$species
lane <- species_key$lane_ID

## replace all the lane IDs with species names
for(j in seq_along(lane)){
  dataset$species <- gsub(lane[j], species[j], dataset$species)
}
ZLoeiu commented 4 years ago

Hi Ciera the ID's you gave me don't match the tree names. The formatting of the species names are different and some names are missing from both tree files. For example, D. Baimaii is in the dictionary of names but not in any of the tree files.

ZLoeiu commented 4 years ago

The vignette I linked uses a genepred file of gene annotations to get their neutral model and I was wondering if we needed this file

iamciera commented 4 years ago
  1. Force the species to match alignment names
  2. Make the neutral model based on one region
  3. Get to the conservation visualization on one region, no matter how weird the pipeline is, so we can use the visualization to see how the conservation changes depending on which neutral model we use.
  4. We will go back and try different to make the neutral model on different regions and see how this affects the conservation model.
iamciera commented 4 years ago

December 13, 2019

phastweb: GUI that essentially mimics script and command line workflow.

Link: https://github.com/DiscoveryDNA/montium_5_TFBS_evolution/blob/master/R/phastcons.Rmd