QC Plots independent from GUI

antortjim commented 6 years ago

Hi peptideShaker team!

I would like to generate the QC plots from the Validation tab and customise them at will using the information exported in the peptideShaker built-in reports. I can easily retrieve the PSM/Peptide/Protein Confidence by looking at the respective Default reports. However, the search engine's score (MS-GF+ in this case) is not exported. For the PSM report, I get a column called Probabilistic PTM score, but it's empty (it's filled with NAs if I read it with R). However, the mzid files from MSGF+ do contain scores:

grep score file.mzid
<cvParam cvRef="PSI-MS" accession="MS:1002470" name="PeptideShaker protein group score" value="4.2409"/>
...

but I haven't found an easy way of getting the PSM scores out of the file.

My question is thus if there's plans to support the generation of these plots from the CLI using peptideShaker CLI, or otherwise, how to extract the required data for doing that. Is it normal that the Probabilistic PTM score column is empty?

Thank you so much beforehand, and have a nice weekend!!

Best regards,

Antonio

hbarsnes commented 6 years ago

Hi Antonio,

All the values shown in the graphical user interface should be availble for inclusion in the reports. Have you looked into creating your own custom reports? To get the search engine scores check out the "Identifiction Algorithm Results" section.

Best regards, Harald

antortjim commented 6 years ago

Hi Harald,

Thanks for your answer!

Right, there's an option in the "Identification Algorithm Results" to export the Algorithm Score. I am however getting strange behaviour:

If I select the features Protein(s) or Decoy together with the score, I get an error after clicking on the export button . This happens both with my PS project and the example project.

Moreover, the MS-GF+ matches have higher confidence the closer to 0 the score is. I am guessing that's just due to how its specific scoring system works. But then, why does the score increase with the confidence on the plot on the left? Some processing of the score must be performed right?

Due to these problems, I am not able to reproduce the plots in the Validation tab.

I am pretty sure I need to remove Decoy hits, but I think I cannot without the Decoy column to tell me which hits are decoy. Maybe there's no decoy hits actually?

I am using peptideShaker 1.16.26, right now on a Windows 7.

Thanks once again!

Sincerely,

Antonio

hbarsnes commented 6 years ago

Hi Antonio,

If I select the features Protein(s) or Decoy together with the score, I get an error after clicking on the export button . This happens both with my PS project and the example project

There is indeed an error here that has already been fixed for the next major release of PeptideShaker.

Moreover, the MS-GF+ matches have higher confidence the closer to 0 the score is.

The MS-GF+ scores we extract from the mzid file are in fact e-values, i.e the lower the better. You can find more details on how the search engines scores are converted into confidence in the supplementary material to the PeptideShaker publication: https://www.nature.com/articles/nbt.3109.

I am pretty sure I need to remove Decoy hits, but I think I cannot without the Decoy column to tell me which hits are decoy. Maybe there's no decoy hits actually?

There is a separate option at the top when creating a custom report that allows you to decide which matches to include. By default only validated hits are included and decoys are excluded.

Best regards, Harald

antortjim commented 6 years ago

Hi Harald,

Thank you so much!

There is indeed an error here that has already been fixed for the next major release of PeptideShaker.

Alright, then we will patiently wait until then :+1:

You can find more details on how the search engines scores are converted into confidence in the supplementary material to the PeptideShaker publication: https://www.nature.com/articles/nbt.3109.

I cannot find an equation in the supplementary material linking MS-GF+ E-value to scores, so I just processed them by taking the logarithm. Everything else I need to get the right score are just constants, so it's not important in this case.

There is a separate option at the top when creating a custom report that allows you to decide which matches to include. By default only validated hits are included and decoys are excluded.

That's great! But since I cannot export the Decoy or Protein(s) features, I cannot know which hits are decoy if I export them together. That means, if I am correct, that I cannot generate the Target vs. Decoy or FDR vs. Coverage plots. I can attempt to generate the Score vs. Confidence by doing as explained above and using the score of the first Validated/Doubtful hit as a proxy for the intercept of the red line. Really looking forward to get the bug fixed soon! :smile:

This is the result I get. I am close but not quite there :stuck_out_tongue:

plot1_ps plot1_r

I don't think it's possible to share a custom report configuration? Below is the R code that creates the second image:

Thank you for your dedication

Best regards,

Antonio

library("ggplot2")
library("dplyr")
library("stringr")
## Read PS report
#####################################
filename <- "Custom_PSM.txt"
custom_report <- read.table(
  filename,
  stringsAsFactors = F, sep = "\t", fill=T, comment.char = "#", skip=1
)

column_names <- strsplit(readLines(filename,n = 1), split = "\\t") %>% unlist
column_names <- column_names %>% gsub(pattern = "\\#", replacement = "")
colnames(custom_report) <- column_names

custom_report[, "Algorithm Score"] <- custom_report[, "Algorithm Score"] %>%
  str_match(string = ., pattern = "MS-GF\\+ \\((\\d\\.\\d*.*)\\)") %>% .[,2] %>% as.numeric
custom_report <- custom_report %>% rename(score = `Algorithm Score`, confidence = `Algorithm Confidence [%]`)
custom_report <- custom_report[, -(c(1, ncol(custom_report)))]

## Process MS-GF+ scores
#####################################
custom_report$score <- -log(custom_report$score)
custom_report$score <- custom_report$score + 100 - max(custom_report$score)

## Find the score of the first validated hit.
## This value is the threshold set by PeptideShaker
custom_report <- custom_report %>% arrange(score)
xintercept <- custom_report[which(custom_report$Validation != "Not Validated")[1],"score"]

ggplot(data = custom_report %>% filter(`Identification Charge` == "2+"), aes(x=score, y=confidence)) + geom_line(col="blue") +
  scale_y_continuous(limits = c(0, 100), breaks = seq(0,100,by=5)) +
  scale_x_continuous(limits = c(0, 100), breaks = seq(0, 100, by=10)) +
  geom_vline(xintercept = xintercept, color="red", size=1) +
  labs(x="Score", y = "Confidence  [%]") +
  ggtitle("Score vs. Confidence") +
  theme_bw()

ggsave("plot1_R.png")

hbarsnes commented 6 years ago

Hi Antonio,

Just released PeptideShaker v1.16.27 which should fix the problem you experienced with the algorithm details export.

Best regards, Harald

antortjim commented 6 years ago

Hi Harald,

Great! Just tried it with the example dataset and it works. Let me check if I can generate the figures with this!!! :smile:

Thanks for looking into it so fast!

Best

Antonio

antortjim commented 6 years ago

Hi Harald,

The data I need is definitely there in the custom report I can now generate with the new PeptideShaker 1.16.27. However, I still don't fully understand how the figures in the Validation tab are generated:

How is the MS-GF+ score transformed into a PS score?

I am taking the negative logarithm of the scores, but there must be more to it as the score distributions I get are different from the ones created by PS.

In other words, what's f below??

PS Score = f(MS-GF+ score)

For example, in the screenshots below you can see the result of the PSM step for one of the spectra in my PS project. Several peptide candidates are found, all with score 0. However, one is selected and then the PSM gets score 100. I am guessing this has to do with the fact that MS-GF+ scores are in the order of the 10^-2. The max score in the project is 0.021 for example. Maybe PS rounds to the first unit in the display and thus all the scores shown are 0 (but really slightly > 0)?

I guess what is going on is that PS (I) internally saves the decimal digits, (II) makes the transformation, and (III) displays the transformed score in the first table in the Peptide Spectrum Matches table.

I really wonder what this transformation is. I have searched in old issues, in the suppl. information of the PS paper and in the Bioinformatics tutorial, and I yet haven't found how it is done. I would really appreciate if you could share it ;)

Thank you beforehand!!

Best, Antonio

issue322_2 issue322_1

hbarsnes commented 6 years ago

Hi Antonio,

Given that Marc answered your last question (https://github.com/compomics/peptide-shaker/issues/213), I will now close this issue. But please don't hesitate to open a new one if you have more questions.

Best regards, Harald

compomics / peptide-shaker

QC Plots independent from GUI #322