Adding peptide coverage information

brennanpincardiff / drawProteins

Creating package to draw proteins from Uniprot API

Other

33 stars 17 forks source link

Adding peptide coverage information #18

Open MiguelCos opened 4 years ago

MiguelCos commented 4 years ago

Hello. I am very glad to know about this package.

I am working on a way to represent peptide coverage over a protein sequence (i. e. represent peptides identified by MS over the whole protein). I think this could be a very good implementation for this package.

brennanpincardiff commented 3 years ago

Thanks for the comment Miguel and sorry for the slow response. Did you make any progress on doing this? Do you have some examples that you could share with me to help? If so, I would love to hear about it. Best wishes, Paul

MiguelCos commented 3 years ago

Hello Paul, I have indeed done some small progress on this front.

I have written a couple of functions using a similar sintax and that output a similar data structure to the one generated from the drawProteins package when extracting the information from uniprot.

My idea was to be able to use drawProteins functions to extract the protein features and plot the structural/PTM/processing information, and to be able to add peptide coverage information (i.e. coming from a MaxQuant search output) on top of that ggplot2 object.

So I wrote getcoverage, that takes peptides and ProteinID information from MaxQuant and transform it into a data format similar to the one to drawProteins; add_peptides which add these peptides per protein as 'features' on the drawProteins output dataframe; and draw_peptides which work similar to your draw_* functions, but adding the peptide features over the protein.

This is still in development and not polished but the functions and a sample script can be found in this repo.

https://github.com/MiguelCos/peptide_to_protein_mapping

Honestly we are still working on some ideas for implementations on specific types of data, to finally decide what we would do with these functions that I've written, but by the looks I believe that the getcoverage, drawPeptides and the add_peptides functions might be good candidates for a PR to drawProteins, as functionalities to plot peptide coverage over proteins.

I would be glad to hear your opinion.

Best wishes, Miguel

brennanpincardiff commented 3 years ago

Thanks for sharing your link, Miguel. Your code looks interesting and very compatible with drawProteins. Do you have some sample data that we can try to use to generate some sample illustration, please? If you could add that to the data folder in your Github repo, I would like to have a play to see if we can make it all work together. It's probably worth creating a coverage_devel fork within drawProteins to develop the code. How does that sound? Best wishes, Paul

MiguelCos commented 3 years ago

Hello Paul,

Sorry for the late reply, yes, I would be happy to share two sample inputs from MaxQuant for you to test the functions in our scripts. Sadly they are too heavy to be loaded through GitHub but I am sending a sharable link so you can get the data.

https://1drv.ms/u/s!ArJgSkDejwROluhgUlGg3zYku-ygHQ?e=EXcHsE

It would be very interesting for us to be able to include these new functions in the drawProteins package. To be honest, I am not completely used to collaborative development on GitHub but this would be a great chance to learn. Would I be able to create this fork then? Or should you do it and then I should use that fork as a remote to continue the development?

Many thanks in advance, Miguel

brennanpincardiff commented 3 years ago

Hi Miguel, You can create the fork and it's worth your while having a go... The method depends a little on how you are writing code - I use R-Studio - do you? In the first instance you would be working on your own version of drawProteins. When it works, you can create a pull request... If you're uncomfortable with that, I can create a fork... your choice... I've taken the first 5000 peptides from your data set and uploaded it to Github - I hope that is OK. The script below looks like the start of the a script to explore your peptides and turn them into something that could be plotted with drawProteins. How do you think this should be visualised. Do you have something in mind? There is more to do but this is a start... That's enough for now, Paul

# pull down sample file with peptides...
library(dplyr)
link <- "https://raw.githubusercontent.com/brennanpincardiff/peptide_data/main/peptides_5000.txt"
peptides <- read.delim(link, header = TRUE, sep = "\t")

# need list of proteins
prot_list <-  dplyr::distinct(peptides, Leading.razor.protein)

# select one of the proteins
prot1 <- prot_list[1,1]

# filter the peptides of this one protein
pep1 <- filter(peptides, Leading.razor.protein == prot1)

# create the data frame for plotting the peptides
# select columns - need begin, end, length, accession, order
pep1_dP <- select(pep1, Start.position, End.position, Leading.razor.protein)
colnames(pep1_dP) <- c("begin", "end", "accession")
pep1_dP$order <- 1
pep1_dP$type <- "SEQ"

# this looks OK.

# download the peptide length and put the peptides on this. 
library(devtools)
devtools::install_github("brennanpincardiff/drawProteins")

library(drawProteins)

# download information
drawProteins::get_features(prot1) -> features
drawProteins::feature_to_dataframe(features) -> prot_data

# show in console
head(prot_data[1:4])

## ----using_draw_canvas, fig.height=3, fig.wide = TRUE-------------------------
draw_canvas(prot_data) -> p
p

## ----using draw_chains, fig.height=3, fig.wide = TRUE-------------------------
p <- draw_chains(p, prot_data)
p

## let's add some peptide information...

p + ggplot2::geom_rect(data = pep1_dP,
                        mapping=ggplot2::aes(xmin=begin,
                                        xmax=end,
                                        ymin=order-0.2,
                                        ymax=order+0.2),
                       fill = "white",
                       colour = "black")

brennanpincardiff commented 3 years ago

So, now I have a longer script that explores some of the possibilities. The script below allows an offset (hard coded) and colours the peptides by score.

It only allows one protein to be shown at a time but this could be altered.

It creates this image...

Feedback please about the best way to develop this... I will probably write a blog post with this script too.. Best wishes, Paul

# pull down sample file with peptides...
library(dplyr)
link <- "https://raw.githubusercontent.com/brennanpincardiff/peptide_data/main/peptides_5000.txt"
peptides <- read.delim(link, header = TRUE, sep = "\t")

# need list of proteins
prot_list <-  dplyr::distinct(peptides, Leading.razor.protein)

# select one of the proteins
prot1 <- prot_list[2,1]

# filter the peptides of this one protein
pep1 <- filter(peptides, Leading.razor.protein == prot1)

# create the data frame for plotting the peptides
# select columns - need begin, end, length, accession, order
pep1_dP <- select(pep1, Start.position, End.position, Leading.razor.protein, Score)
colnames(pep1_dP) <- c("begin", "end", "accession", "score")
pep1_dP$order <- 1
pep1_dP$type <- "SEQ"

# this looks OK.
# we might be able to use score but not right now...

# next plan to draw the chain and then can add the peptides on top...
# other options might be to put them above the screen...

library(devtools)
devtools::install_github("brennanpincardiff/drawProteins")

library(drawProteins)

# download information
drawProteins::get_features(prot1) -> features
drawProteins::feature_to_dataframe(features) -> prot_data

# show in console
head(prot_data[1:4])

## ----using_draw_canvas, fig.height=3, fig.wide = TRUE-------------------------
draw_canvas(prot_data) -> p
p

## ----using draw_chains, fig.height=3, fig.wide = TRUE-------------------------
p <- draw_chains(p, prot_data)
p

## let's add some peptide information...

p + ggplot2::geom_rect(data = pep1_dP,
                        mapping=ggplot2::aes(xmin=begin,
                                        xmax=end,
                                        ymin=order-0.2,
                                        ymax=order+0.2),
                       fill = "white",
                       colour = "black")

# I would like to offset the peptides.

pep1_dP$order <- 1.4

p + ggplot2::geom_rect(data = pep1_dP,
                       mapping=ggplot2::aes(xmin=begin,
                                            xmax=end,
                                            ymin=order-0.05,
                                            ymax=order+0.05,
                                            fill = score),
                       colour = "black") +
    scale_fill_gradient(low = "white", high = "black") +
    annotate("text", x = -50, y = 1.4, label = "Peptides") +
    theme_bw(base_size = 20) + # white background
        theme(panel.grid.minor=element_blank(),
              panel.grid.major=element_blank()) +
        theme(axis.ticks = element_blank(),
              axis.text.y = element_blank()) +
        theme(panel.border = element_blank())

MiguelCos commented 3 years ago

Hello Paul,

I really like what you managed to do with the visualization of the peptides. My current attempt looks like this (including drawProteins in the workflow) which is not particularly nice:

I like to be able to visualize the peptide sequence but I think that in general, it looks way better to just show the coverage itself. At some point, we would actually like to visualize things like c-terminal and n-terminal cleavage window, so it would be interesting to have the option to put peptides on different 'orders'. That's why I included that info into the getcoverage function.

I was getting the coverage data both from the proteinGroups file and the peptides.txt file, but now it doesn't seem to make much sense. The peptide.txt file from MQ should be enough, as you are showing here.

One of the final goals for us with these scripts and functions would be to visualize n-terminomics/degradomics data. For example, it would be nice to have a way to visualize if there is some level of offset between a particular n-terminal peptide and an annotated/canonical processing site (i.e. processing of signal peptide as annotated in Uniprot). It would be awesome to be able to detect potential non-annoated/non-canonical processing sites. After managing to integrate that we would actually love to include quantitative information, let's say, fold-change information per-peptide after DE analysis. The implementation of this would work the same way as the score does, but we would need to use a different input.

We are currently working on producing the sample n-terminomics data that we would finally be analyzing with this series of tools.

I have to say that your solutions to extracting the interesting features from the data and format them in drawProteins format are way more straightforward than mine.

I will go on and fork this repo and continue working on the functions from there and would send a pull request at least when I have the coverage functions ready.

Many thanks for your input and collaboration.

Best wishes, Miguel