Post processing protein groups in peptide_summary_intensity file

antortjim commented 6 years ago

Hi there!

Thanks for developing moFF, I am finding it extremely useful in my project. I was however wondering if you have any recommendation on how to post process the moFF output given in the peptide summary so that no protein id appears in more than 1 protein group:

It can (and should) appear in several rows because several peptides are mapped to it.
But always accompanied by the same protein ids.

In other words, if we get the unique ids from the peptide summary file: cut -f 2 peptide_summary_intensity.tab | tail -n +2 | sort | uniq -c each protein id appears in only one group.

The solution produced by MaxQuant (proteinGroups.txt) also exhibits this property, so I think it would make sense for moFF to give the possibility to further process the protein groups, using a user specified criteria.

I understand there are several ways to do this, all using different interpretations of Occam's razor. For example, the following situation:

	Protein ID
Protein group 1	A
Protein group 2	A, B
Protein group 3	A, B
Protein group 4	A, B
Protein group 5	A, B, C

could be solved by:

merging all 5 and assigning them Protein ID A only, ignoring B and C.
merging all 5 and assigning them IDs A and B.
merging the first 4 with IDs A and B and leaving the 5th alone.
etc

How would you go with this? Is there any implementation available you could lead us to? Thanks beforehand!!

Best regards Antonio

Maux82 commented 6 years ago

Hi Antoio,,

At the moment we do not implement any Ocam's razor heuristic to get unique protein Id for the shared peptide. I am aware that there are several way to do that, and of course this must be a user choice. The idea behinf moFF, is just puzzle piece that could fit in custom proteomics pipeline, it is not a full proteomics sw that covers all the proteomics data analysis workflow (MS2 search, quant as peptide level , quant at protein level , label-free / labeling data , etcc ).

To go from moFF peptide quantificatio to qProtein quantification , I can suggest two solutions:

use MSqRob R package, it handles data from moFF peptide summary and perform robust statistical analisys at proitein level. Moreover it has the function that you are looking.
you can load moFF peptide intensities into MSnbase objects and perform the summarization at proteins level using their methods. Here some R code that you can use to load moFF results into MSnbase.

moff data into msnset

read moff file in msnset object

set = readMSnSet2(path,ecol = -c(1,2), sep = '\t')

optional

meta data for each sample; dataframe with one row/sample

pd = data.frame(condition = ..., lab = ...) rownames(pd) = sampleNames(set)

meta data for each peptide; dataframe with one row/peptide

fd = data.frame(contaminant = ...) rownames(fd) = featureNames(set)

add meta data to set object

set = MSnSet(exprs(set), fData = AnnotatedDataFrame(fd), pData = AnnotatedDataFrame(pd))

robust summarisation

protset <- combineFeaturesset,fun="robust", groupBy = fData(set)$protein(name of protein collumn in fData),cv = FALSE)



Cheers 
Andrea

antortjim commented 6 years ago

Hi Andrea

Thanks for your answer! I am planning to join moFF to MsqRob and perform peptide based quantification indeed :+1:

The function is actually very handy, thanks! Though it still returns groups with shared ids, only they are always the same length.

I tried running the code snippet with MSnbase and I just changed:

protset <- combineFeaturesset,fun="robust", groupBy = fData(set)$protein(name of protein collumn in fData),cv = FALSE)

with

protset <- combineFeatures(set, fun="median", groupBy = fData(set)$Protein.IDs, cv = FALSE) because there was some parenthesis missing.

Thanks for your time.

Cheers Antonio

compomics / moFF