Closed singjc closed 1 year ago
Thanks for the PR! I think there are many interesting ideas that you brought up.
Regaring the export of peptidoforms not part of the library: In the ideal world scenario, the spectral library would contain the coordinates for all peptidoforms of interest to achieve maximum sensitivity. With the DL-based predictors, this would be possible nowadays, but with the experimental libraries that we used before, the libraries have typically been sparse and site-localization has not always been very confident. For this reason, an assumption of IPF was to not trust the site-localization at all and to always infer this from the theoretical model. The peptidoforms that are then exported by default might differ from the original query site-localization but are typically better supported by the data. In some cases, we even export multiple peptidoforms and their scores for the same assay, with the idea that TRIC will sort this out. It's a bit convoluted in the "pyprophet export" code, but the idea is that TRIC would actually align over the peptidoform confidence instead of the peakgroup confidence.
So this was the idea behind that original implementation. I think your suggestion might make sense if we would use a complete library covering all options. In the original implementation, we just reported all results, let IPF and TRIC decide what is supported best by the data and then we would remove redundancy later on. It's important to note though, that the default export parameters try to prevent double assignment of the same peak group to different peptidoforms. That has frequently been a concern with the implementation, but with the measures taken, I have found that the approach to be quite robust.
How do you think about it? Would that modification make sense in your workflow? If yes, I think it would be great if the documentation and help could be extended to highlight the different usage scenarios.
Color Adjustments: Great, let's merge that. Would it be possible to put this in a separate PR?
Report pi0 Estimation: Same here, let's merge that.
Best regards, George
Thank you for your feedback George!
I agree, ideally we would want the library to contain all potential peptidoforms, and then infer which peptidoforms are best supported by data.
My thoughts for experiment generated libraries, was to only consider the peptidoforms that were experimentally identified, because as you mentioned the peptidoform coverage is generally sparse. But yeah I guess, these predicted in silico libraries, like DeepPhospho, would address this issue.
Sure, I can add to the openswath.org documentation to highlight these different scenarios. I guess the trade-off would be limiting your identifications based on prior experimental evidence, at the potential loss of peptidoforms that were not covered in the library that may actually be present.
I made a separate PR #102 for the color adjustment. I'll make a separate PR for the pi0 estimation report once #102 is merged.
Best,
Justin
Exporting TSV IPF Results
When exporting peptidoform IPF results, I noticed that a few of the theoretically generated forms (non-experimentally seen forms from the library) are being reported/exported. To my understanding, we would only want to report the forms that we actually have experimental evidence for that we identify in the library (i.e. from DDA data), and the theoretical forms are only used for the scoring and assessment of localization of peptidoforms to peak-group?
It also seems to report duplicate entries for some peptidoforms that have the same RT, but a different feature id from.
To fix these issues (if they are issues and not intented behaviour), I implemented a method to generate a UNIMOD_TO_CODENAME mapping table, which is just a mapping of the experimental UniMod entry peptide ID mapping to the corresponding CodeName entry peptide ID. I use this table in the
export
module to join the SCORE_IPF table on the UniMod peptide ID, instead of the CodeName peptide IDs. I wasn't sure where was best to put this table generation step, so I just put it in a separate module callededit-osw
. Ideally, I think this table should be created from the library generation step in OpenMS (OpenSwathAssayGenerator/Decoy Generator), but for now this is easier to implement. I also added a method to add a column to the PEPTIDE table that groups the sequence and peptidoforms, more so for downstream analyses, but is not necessary. I useNOTE: These methods (
create_unimod_codename_mapping
andcreate_peptidoform_group_mapping
) introduce two new dependencies/suggestions. The first dependency isfrom pyopenms import AASequence
required for handling the modified sequence strings conversions from UniMod to CodeName. The second is more of a suggested module, but not necessary. I use Dask to create a partioned DataFrame, which allows for parallel computation on the small subset data frames. If the user doesn't have this module, then it will just perform regulargroupby
andapply
serially.Run Command
Help
Example
Using the following export command:
pyprophet export --in merged_original.osw --out merged.tsv --format=legacy_merged --max_rs_peakgroup_qvalue=0.2 --ipf_max_peptidoform_pep=0.4 --max_global_peptide_qvalue=0.01 --max_global_protein_qvalue=0.01 --no-transition_quantification
If we look at a specific example sequence, IASPIQHEHDSGSR from the synthetic phospho dilution series data
Library
The library has two experimentally seen peptidoforms IAS(Phospho)PIQHEHDSGSR(Label:13C(6)15N(4)) and IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4)), and then there is also a theoretical generated peptidoform IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4))
Current Export Results
If we look at the results that get exported, one of the experimentally seen forms IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4)) is identified in run 2848806567456108792. The theoretical peptidoform IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4)) also get reported. Depending on what cutoff you use for m_score, some of these would probably get removed, so maybe it's not that big of deal? But there are a few cases where it might still pass through the cutoff as in 2848806567456108792, it gets an m_score of 0.009810127. As mentioned above, there are also duplciate entries of the same feature, from different assays; in 2848806567456108792 there is -4795326927976004004 and -7026778265845113779 mapping to the same RT.
Export Results Merged on UNIMOD_CODENAME_MAPPING
If we join the SCORE_IP table on the UniMod peptide ID instead, then we only get the experimentally seen peptidoform in the results. One caveat that comes up now, is the FullPeptideName that is returned is the one with the UniMod entry instead of the CodeName, which I guess would leave the user to have to manually look up what it responds to. If this is an issue, I could add a step to convert all UniMod to CodeName using AASequence.
Example of the UNIMOD_CODENAME_MAPPING
When generating the UNIMOD_CODENAME_MAPPING, if there is either no matching UniMod peptide ID to a corresponding CodeName peptideID or vice-versa, I simiply put a -1.
Example of the PEPTIDOFORM_GROUPING column in PEPTIDE table
Allowing Color Adjustment for Color Blind Friendly Reports
This is a minor addition to add color blind friendly color adjustment to the reports, as requested by #100
This change includes
color_blind_friendly()
method for extracting a pair of color blind friendly colors. There are 4 color palettes: 'normal'. 'protan', 'deutran', or 'tritan'. This introduces an extra argumentcolor_palette
for extracting one of the 4 color palette pairs, with the default being set to 'normal'.Example
Report and Message when pi0 Estimation fails
I extended the error messaged when the estimated pi0 <= 0, to show what the current lambda values were used. I also included a report pdf of the pvalue histogram used in the pi0 estimation, to give the user a better idea of why it may have failed.
Error: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda. Current lambda: [0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95]
Example