ScienceParkStudyGroup / studyGroup

Gather together a group to skill-share, co-work, and create community
https://www.scienceparkstudygroup.info
Other
6 stars 12 forks source link

Messy Workflow! #31

Closed Farmiloe3 closed 5 years ago

Farmiloe3 commented 6 years ago

I have looking into the sequence conservation across a number of gene promoters. I am interested specifically in the indels and substitutions. To do this analysis I joined together multiple sequences (from the same species) into one long sequence with 'spacer' sequences between and aligned this with an amalgamated sequence from another species as one alignment. Using this alignment I can then either split it up again into the individual promoter sequences or 'code' it based on substitutions/indels to create a sort of heatmap.

Currently I am having difficulty pulling out the alignments that I am interested in based on certain critera. I am also aware that my workflow is incredibly unsophisticated and messy. Currently I start at UCSC, move to bash, to R and then to excel.

It would be amazing to get some input into what I can do to clean up my pipeline and write some slightly more sophisticated code!

Thanks,

Grace

mgalland commented 6 years ago

Hello Grace! Sounds indeed like something we can help with.

Could you think about a minimal example such as two promoters from two species? You could provide these two sequence as fasta for your input (please attach it with your comment).

I am having some difficulties to imagine your desired output. Is there some example from the literature or from your work that you could provide?

Thanks Marc

Farmiloe3 commented 6 years ago

The promoters I'm looking at are all bound by a specific protein. I'm trying to look at the evolutionary relationship between these promoters and the protein that binds them. The hypothesis is that there will be a few important promoters that were put under selective pressure for increased or decreased binding by these proteins. I am using this alignment/screening process to try to identify a few key genes/promoters to then investigate in the wet lab.

I am most interested in the alignments that have an insertion/deletion where the protein binds to the DNA. I have attached an example of the alignments I have done, this one is about 400 promoters and is an alignment between the human and chimpanzee sequences.

I have had a look in the literature but I have not been able to find examples similar to what I am trying to do...

I ideally want 2 outputs, one in the form of a list of promoters that show the pattern I am interested in and another which is a visualisation of the alignment across all the promoters.

I am not sure if I am making things overly complicated for myself. Hopefully this makes some sense but I can explain more clearly on Tuesday?

Thanks, Grace

Align_100418_fa.txt