compomics / peptide-shaker

Interpretation of proteomics identification results
http://compomics.github.io/projects/peptide-shaker.html
48 stars 19 forks source link

Protein Inference problem in SC quantitation #372

Open knrhus opened 5 years ago

knrhus commented 5 years ago

Hi! I am quite new in PeptideShaker so I am still not familiar with all the available options and more importantly with methods that the program utilizes to do certain things.

Now I'm struggling with my data which has many hits of related proteins or simply isoforms. Therefore, there are many situations where peptide is shared between different proteins and I noticed that such peptides are present in the peptide section of different proteins (with red square in Protein Inference column).

Now, I would like to do spectral counting and I am not sure how such peptides are counted? Does the program do MS2 Quantification only using unique peptides? Or maybe it includes shared peptides but if that is the case, then how it is done? I am simply afraid of summing SC value of the same peptide several times , and hence introducing big error to my data. At the end, if that would be the case, is there any method to eliminate redundant peptides from the quantitative analysis?

Thank you in advance for your response! ;)

mvaudel commented 5 years ago

Hi,

Apologies that this is not correctly documented, we will extend the documentation accordingly when time allows. When a peptide is shared between different groups, the spectral counts are distributed equally between the groups. E.g. a peptide ranked best and validated in ten spectra and found in two groups, will give five PSMs to each group. This does not eliminate the problem of shared peptides, but I am not aware of any solution that does... And I don't think that it is the main shortcoming of spectral counting.

As a rule of thumbs, spectral counting provides rough estimates but if you want to do quantitative analyses, intensity-based quantification will offer better performance.

Hope this helps,

Marc

knrhus commented 5 years ago

Hi,

Thank you for your reply. I was intentionally delaying my response mostly because I wanted to familiarize better with the software, as well as read more about principles underlying statistics and validation. In fact, in your supplementary note from 2015 article (Nat. Biotechnol.), I found a lot of useful information about the software, including fragments regarding SC quantitation.

I am dealing with relatively small size samples (~50 proteins) so on the basis of what I have read, my protein FDR is not a good parameter for controlling my data. However, my data sometimes behave weirdly so maybe you have an idea why is it so. To the point:

  1. For some time I was experimenting with my validation parameters and I noticed some unusual thing. When I load my project to PS mostly under default settings (FDR 1% on all levels) I have 61 proteins, and all are marked as "doubtful". That is not surprising me as the data size is small. Then, when I change protein FDR to 10% I get 86 identifications, where all are also marked as "doubtful". Then, however, If I again switch to 1% protein FDR, I am getting 61 proteins but now 44 of them carry green "validation sign" and are marked as "confident" and 17 remain "doubtful". It is quite weird and I am aware that such actions do not change the value of my data, however, maybe there is some explanation why software behaves like that.

  2. Another question which I have: Is there any difference in setting the parameters (i.e. FDR levels) before the analysis in PS (in Identification Settings) and during analysis in PS (in Validation Tab)? I am asking this mostly because when I change my FDR levels in Validation Tab and i get new validated proteins, when I export the data to excel I notice that MS2 quant. for such proteins is 0.

  3. I have also one more question related more to my data itself than the software, however maybe you could propose me some useful explanation. I am quite new in shotgun proteomics so sometimes it is still not easy for me to recognize the true data. Within my validated data I found at least 10 % of proteins that is related to various types of Immunoglobulin chains. I am working with snake venoms and to this point I have never seen similar proteins in any venoms. Identified hits comes from TrEMBL and I found that such transcripts were sequenced during venom gland analysis. I know that this question is not easy but maybe hits from Ig-like proteins is something normal in shotgun proteomics and are perceived as sample contamination. I would be grateful if you could provide me any information regarding the presence of Ig-like hits in the results after shotgun proteomics. Because at this stage I am not sure if it is worth paying more attention to this hits if it would turn out that such results are simple contamination and that is "popular knowledge" in the field.

In the end, I would like to thank you and all comp omics group for great tools (software, lecture videos and tutorials!) which cause that from an absolute beginner in shotgun proteomics, in a relatively short time, I become capable of using your tools to analyze complex data!

knrhus commented 5 years ago

Hi, I would just like to amend to my second point. Actually, everything seems to work correctly and my previous impression came from simple coincidence that new validated proteins under less stringent FDR had so low abundance that when I exported the data I observed 0 MS2 quant. However, with different proteins, this problem does not occur so it was just my oversight. Sorry for the confusion!