Add search engine score summaries

jpfeuffer commented 2 years ago

Based on idXMLs.

Histograms per search engine:

one series per file
always targets (dark color) and decoys (light color)
use e.g. specEvalue for MSGF and xcorr for comet

Histogram per file (e.g. dropdown menu) or histogram after merging

one series per search engine, after Percolator/IDPEP
always targets (dark color) and decoys (light color)
use PEP or PP as score

Either: Barplot on number of matching search engine IDs per PSM Or: Scatterplot of number of matching search engine IDs per PSM versus the best PEP/PP score

WangHong007 commented 2 years ago

Hi Julianus! @jpfeuffer An example is here: pmultiqc/multiqc_report.html

Search scores summary plot

Using mean Comet xcorr or MSGF SpecEvalue (the mean of each file)

PEPs summary plot

Using the mean of PEPs for both search engines (the mean of all the files)

Identified PSMs number plot

Question:

Barplot on number of matching search engine IDs per PSM

I haven't found any PSM quantitative information in idXMLs yet. Does this need to be counted manually? If so, which part of information should be counted, for example, the frequency of sequence in PeptideHits?

jpfeuffer commented 2 years ago

Hi! Thanks a lot, that is a good start.

1) How hard would it be to do a Histogram, as we did for most of the other diagrams. I think the mean is too uninformative. Maybe we could have the files for selection up there (where Comet and MSGF are now). And then have different plots for the search engines.

2) Same comment.

3) Here I meant more the agreement over different search engines. This could be done by checking the "support" metavalue in the idXML after consensusID. Again, a histogram would be better.

WangHong007 commented 2 years ago

Got it, I will also give an example when I finish it

WangHong007 commented 2 years ago

Hi Julianus! One more thing to confirm is whether you want to use the Histogram class to quantify and plot different ranges of search scores and PEPs, since its current function is to quantify and plot data from specific ranges or values. In addition, ploting the search score or PEP for each PSM results in flat images where toolbox functions are disabled.

jpfeuffer commented 2 years ago

Hi! Yes, the histogram should support arbitrary ranges with start value, end value and number of bins. We then have to find a good range for the search scores. Flat imageare ok for now.

WangHong007 commented 2 years ago

A new example is here: pmultiqc/multiqc_report_6.html Three sections as follow: So far I haven't found any examples of using multiQC to draw histograms, only bar graphs; The current Histogram class is also used to draw bar graphs. The range of search scores and PEPs is temporarily start=0, end=1, step=0.2

jpfeuffer commented 2 years ago

Hi. I think a histogram is basically a bar graph. So that's fine. I would just use much more bins (around 100?). That means just decrease the step size. And maybe we can reduce the space between the bars a little bit. Such that they are bit closer to each other. Maybe that happens automatically if we increase the number of bins.

WangHong007 commented 2 years ago

Hi Julianus! The step size I adopted was 0.02, because 100 bars would lead to flat images, and they could not reflect the quantity information. Then I stacked the bars so that they looked closer together. An example: pmultiqc/multiqc_report.html

Three sections as follow:

jpfeuffer commented 2 years ago

Yes this is better. I think stacking is fine. But I think you need to adapt start and end for Xcorr and SpecEvalue. For SpecEvalue probably -log10(SpecEvalue) is better as a value. They can have much broader ranges than 0-1.

WangHong007 commented 2 years ago

Got it.

WangHong007 commented 2 years ago

Hi Julianus! A new Example is here: report In this section, I use -lg(SpecEvalue) for MSGF+ and |xcorr| for Comet. The range of -lg(SpecEvalue) is start=-1, end=inf, step=0.1. The new bar plot as follow: Do I need to remove the first few empty bars (SpecEvalue >= 10) in the plot like this:

jpfeuffer commented 2 years ago

Nice. But I kind of thought about adapting the end in the sense that the bins would then cover a larger range. What is the maximum value of the respective scores? It should be much higher. It does not make sense to show a huge bar for the last bin.

WangHong007 commented 2 years ago

Xcorr has a range greater than 0, and an example of his histogram is as follows: I use the range start=0, end=5, step=0.1 and the result is as follows:

SpecEvalue has a range of 0 to 1, and its negative logarithm has a range of 0 to infinity. I use the range start=0, end=inf, step=0.4, and the results are as follows:

jpfeuffer commented 2 years ago

Hi! That looks fine for a start. Maybe we have to adjust the ranges later, after we saw some more data. You have to make sure that it is very easily edited. Maybe with a global constant variable in the module. e.g. XCORR_HIST_RANGE = (x,y)

jpfeuffer commented 2 years ago

Other than that, can you upload a full report somewhere? So I can double-check the whole thing?

ypriverol commented 2 years ago

@WangHong007 I think you can make a proper PR that includes the examples of generation of the reports with the new changes.

WangHong007 commented 2 years ago

New example is here: report A global constant variable will be added. BTW, the Histogram class need to be modified to adapt these changes.

WangHong007 commented 2 years ago

@ypriverol Got it.

jpfeuffer commented 2 years ago

Yes no problem with the Histogram class. As long as the old things still work and the class does not become too complicated.

jpfeuffer commented 2 years ago

By the way, by which logic did you do the consensus PSMs?

I think it is better if we call it: Number of agreeing search engines per PSM

And then just list the numbers : 1 or 2 (or in the future 3 or 4)

jpfeuffer commented 2 years ago

I can see it when you open the PR

bigbio / pmultiqc

Add search engine score summaries #62