compomics / moFF

A modest Feature Finder (moFF) to extract MS1 intensities from Thermo raw file
Apache License 2.0
33 stars 11 forks source link

Post processing protein groups in peptide_summary_intensity file #28

Closed antortjim closed 6 years ago

antortjim commented 6 years ago

Hi there!

Thanks for developing moFF, I am finding it extremely useful in my project. I was however wondering if you have any recommendation on how to post process the moFF output given in the peptide summary so that no protein id appears in more than 1 protein group:

In other words, if we get the unique ids from the peptide summary file: cut -f 2 peptide_summary_intensity.tab | tail -n +2 | sort | uniq -c each protein id appears in only one group.

The solution produced by MaxQuant (proteinGroups.txt) also exhibits this property, so I think it would make sense for moFF to give the possibility to further process the protein groups, using a user specified criteria.

I understand there are several ways to do this, all using different interpretations of Occam's razor. For example, the following situation:

Protein ID
Protein group 1 A
Protein group 2 A, B
Protein group 3 A, B
Protein group 4 A, B
Protein group 5 A, B, C

could be solved by:

  1. merging all 5 and assigning them Protein ID A only, ignoring B and C.
  2. merging all 5 and assigning them IDs A and B.
  3. merging the first 4 with IDs A and B and leaving the 5th alone.
  4. etc

How would you go with this? Is there any implementation available you could lead us to? Thanks beforehand!!

Best regards Antonio

Maux82 commented 6 years ago

Hi Antoio,,

At the moment we do not implement any Ocam's razor heuristic to get unique protein Id for the shared peptide. I am aware that there are several way to do that, and of course this must be a user choice. The idea behinf moFF, is just puzzle piece that could fit in custom proteomics pipeline, it is not a full proteomics sw that covers all the proteomics data analysis workflow (MS2 search, quant as peptide level , quant at protein level , label-free / labeling data , etcc ).

To go from moFF peptide quantificatio to qProtein quantification , I can suggest two solutions:

moff data into msnset

read moff file in msnset object

set = readMSnSet2(path,ecol = -c(1,2), sep = '\t')

optional

meta data for each sample; dataframe with one row/sample

pd = data.frame(condition = ..., lab = ...) rownames(pd) = sampleNames(set)

meta data for each peptide; dataframe with one row/peptide

fd = data.frame(contaminant = ...) rownames(fd) = featureNames(set)

add meta data to set object

set = MSnSet(exprs(set), fData = AnnotatedDataFrame(fd), pData = AnnotatedDataFrame(pd))

robust summarisation

protset <- combineFeaturesset,fun="robust", groupBy = fData(set)$protein(name of protein collumn in fData),cv = FALSE)



Cheers 
Andrea 
antortjim commented 6 years ago

Hi Andrea

Thanks for your answer! I am planning to join moFF to MsqRob and perform peptide based quantification indeed :+1:

The function is actually very handy, thanks! Though it still returns groups with shared ids, only they are always the same length.

I tried running the code snippet with MSnbase and I just changed:

protset <- combineFeaturesset,fun="robust", groupBy = fData(set)$protein(name of protein collumn in fData),cv = FALSE)

with

protset <- combineFeatures(set, fun="median", groupBy = fData(set)$Protein.IDs, cv = FALSE) because there was some parenthesis missing.

Thanks for your time.

Cheers Antonio