Difference b/w protein, peptide and PSM FDR?

nattzy94 commented 4 years ago

Hi,

I would like to check my understanding of the 3 different levels of validation thresholds available. Say for example I set the FDRs for protein, peptide and PSM at 1%, 1%, and 5%. This will mean that for each PSM, if I have 5 out of 100 spectra matching to the decoy database, then that PSM will be classed as a FP and have a doubtful validation.

In turn, as the peptide FDR is at 1%, this will mean that it will allow at most 1 unvalidated PSM out of 100 to match. So if I had 2 PSMs matching to one peptide and 1 is validated and the other unvalidated, then my peptide will have a doubtful validation.

Finally at the protein level, this will mean that I can have at most 1/100 unvalidated peptides matching to my protein for it to be validated.

In my case, I am looking for small proteins in a MS dataset. As the experiment was not optimised for small proteins, my supervisor suggested that I could change the FDR to find more validated hits. In this case, it would not make sense to change the protein FDR without changing the PSM FDR. E.g. setting protein FDR at 10% while leaving PSM at 1%. Would it make more sense to change all of the FDRs to 10%?

Thank you!

hbarsnes commented 4 years ago

Hi,

I would recommend having a look at our tutorial material at http://compomics.com/bioinformatics-for-proteomics and especially chapter 1.5 about Peptide and Protein Validation.

You are pretty much correct with regards to the numbers above though. However, there is an important distinction between Validated/Not Validated and Confident/Doubtful. The former refers to the statistical validation using a given FDR threshold, while the latter refers to the additional quality filters PeptideShaker uses to categorize the statistically validated proteins/peptides/psms, where matches passing the filters are labeled Confident and matches not passing the filters are labeled Doubtful. Note here that both Confident and Doubtful are thus (statistically) Validated.

The types of filters depends on the level. For example, at the psm level the precursor mass deviation and peptide sequence coverage is checked, while at the peptide and protein levels there is a requirement to have at least two confident psms or peptides, respectively.

With regards to changing the FDR-levels, I don't think there is one correct answer here. It will very much depend on your experimental setup and the research question you are trying to answer. Increasing the FDR threshold to 10% simply means that you have to except a higher number of false positives, and it is up to you whether you are willing to accept this or not. The FDR is not the only quality check you can make use of though, so if increasing the FDR you will rather have to rely more on manual inspection of the results and less on the statistical validation.

About having small proteins: yes, you will probably find less peptides for these proteins, but if the peptides you do find are good unique matches, the protein FDR is not the most important.

The reliance on the statistical validation is more of an issue if you either have very few spectra or a small FASTA file. In such cases, the statistical validation more or less breaks down due to a lack of data and you are left with having to manually verify the identifications in other ways.

Best regards, Harald

nattzy94 commented 4 years ago

Hi Harald,

Thanks very much for the reply.

The types of filters depends on the level. For example, at the psm level the precursor mass deviation and peptide sequence coverage is checked, while at the peptide and protein levels there is a requirement to have at least two confident psms or peptides, respectively.

The filters here are the additional quality filters of PS and not the statistical validation right?

I've read about the validation criteria in the tutorial. What I get is that for PSM level FDR, PSMs are score based on the similarity of the experimental spectra to the theoretical spectra to that peptide. An FDR can then be set at a PSM score threshold to accommodate false positives. My confusion is about peptide FDR level. For peptide level FDR, what score is used to set the threshold? I.e. what is the equivalent PSM score for peptides? Is there some kind of score that averages the PSM scores mapping to that peptide?

In my case, I have searched a large database of small proteins (2 mil. sequences) and PS outputs a list of 17 validated proteins (sorfdb-search-08may_56h_0_Default_Protein_Report.txt). However, none of the proteins have validated peptides in the "#Validated peptides" column.

In this case, is the validated peptide column here referring to statistical or PS' internal quality filter?
If it is referring to the statistical validation, then why is it that the proteins are still reported given that the peptides mapping to them are statistically invalid?

Sorry for the many questions and thanks for your patience!

hbarsnes commented 4 years ago

The filters here are the additional quality filters of PS and not the statistical validation right?

Correct.

For peptide level FDR, what score is used to set the threshold? I.e. what is the equivalent PSM score for peptides? Is there some kind of score that averages the PSM scores mapping to that peptide?

You can find all the details about how the scores are calculated in the supplementary material to the PeptideShaker paper, but yes, the peptide scores are a combination of the PSM scores for the PSMs mapping to the given peptide, in the same way that the protein scores are a combination of the peptide scores.

In this case, is the validated peptide column here referring to statistical or PS' internal quality filter?

It is referring to the number of statistically validated peptides, i.e. including peptides labeled as both Confident and Doubtful.

Maybe you already know this, but if you click the icons in the Validation column you will get more information about both the statistical validation and the quality filters for the selected protein, peptide or PSM.

If it is referring to the statistical validation, then why is it that the proteins are still reported given that the peptides mapping to them are statistically invalid?

Given that the statistical validation is done separately for each level (i.e. psm, peptide and protein), and there can be very different numbers of elements at each level, you may see cases where a protein is validated even if none of its peptides are validated. Basically the requirements for how high a score has to be to be considered statistically validated will be different.

You may think of it as having lots of peptide evidence that alone is not does not allow you to trust any of the individual peptides, but by summing up all of the peptides for a given protein there is enough evidence to support that the protein is most likely there.

That being said, the fact that none of the peptides are considered validated might be cause for concern, hence the labelling of the proteins as Doubtful. This does not mean that the protein is not there, just that care should be taken before trusting the evidence.

nattzy94 commented 4 years ago

You can find all the details about how the scores are calculated in the supplementary material to the PeptideShaker paper, but yes, the peptide scores are a combination of the PSM scores for the PSMs mapping to the given peptide, in the same way that the protein scores are a combination of the peptide scores.

Thanks! The supplementary material is really helpful.

This does not mean that the protein is not there, just that care should be taken before trusting the evidence.

Yes, indeed the experimental protocol was not optimised for small proteins and I am using this as a practice dataset.

Maybe you already know this, but if you click the icons in the Validation column you will get more information about both the statistical validation and the quality filters for the selected protein, peptide or PSM.

This is actually the problem I am facing which is that the GUI doesn't work for me. I would like manually inspect the proteins identified. In particular, I am following methods from a paper on small proteins where one of the criteria is that the protein should have a sequence tag of 5 consecutive b and y ions. Unfortunately when I load the project file into PS, it takes a really long time to load. After loading, the GUI usually hangs and I am unable to inspect any single protein entry as I can't click on anything. This is probably to do with the large search database I used.

Is there a way to analyze the results in greater detail without opening the entire file for e.g. I am really only interested in looking at the 17 proteins that were identified from my spectra and in particular verifying if the ions match.

Another separate request. Is there a way to export the Validation plots using the command line e.g. Score vs. Confidence plots that are shown on page 3 of the peptide/protein validation tutorial?

Thanks very much!

hbarsnes commented 4 years ago

This is actually the problem I am facing which is that the GUI doesn't work for me. [...] This is probably to do with the large search database I used.

Ah, yes, in that case you will rather have to depend on the exports of the proteins, peptides and PSMs. We are however working on a beta version that should be better at handling large sequence databases, and better at memory handling in general. This should hopefully allow you to inspect your data in the GUI as well.

Would it be possible for you to share your data with us so that we can test it in the upcoming beta version and hopefully make it possible view the your in the GUI?

Is there a way to analyze the results in greater detail without opening the entire file

No, I'm afraid not. The exports will then be your only option.

Another separate request. Is there a way to export the Validation plots using the command line e.g. Score vs. Confidence plots that are shown on page 3 of the peptide/protein validation tutorial?

No, sorry, these are only available from the GUI. You will have to recreate them from the exports.

hbarsnes commented 4 years ago

This issue should hopefully be fixed in the new release that is available now. If not, please let us know by opening a new issue.

compomics / peptide-shaker

Difference b/w protein, peptide and PSM FDR? #404