PyProphet / pyprophet

PyProphet: Semi-supervised learning and scoring of OpenSWATH results.
http://www.openswath.org
BSD 3-Clause "New" or "Revised" License
29 stars 21 forks source link

Fix IPF Export with Unimod to Codename Table #101

Closed singjc closed 1 year ago

singjc commented 2 years ago

Exporting TSV IPF Results

When exporting peptidoform IPF results, I noticed that a few of the theoretically generated forms (non-experimentally seen forms from the library) are being reported/exported. To my understanding, we would only want to report the forms that we actually have experimental evidence for that we identify in the library (i.e. from DDA data), and the theoretical forms are only used for the scoring and assessment of localization of peptidoforms to peak-group?

It also seems to report duplicate entries for some peptidoforms that have the same RT, but a different feature id from.

To fix these issues (if they are issues and not intented behaviour), I implemented a method to generate a UNIMOD_TO_CODENAME mapping table, which is just a mapping of the experimental UniMod entry peptide ID mapping to the corresponding CodeName entry peptide ID. I use this table in the export module to join the SCORE_IPF table on the UniMod peptide ID, instead of the CodeName peptide IDs. I wasn't sure where was best to put this table generation step, so I just put it in a separate module called edit-osw. Ideally, I think this table should be created from the library generation step in OpenMS (OpenSwathAssayGenerator/Decoy Generator), but for now this is easier to implement. I also added a method to add a column to the PEPTIDE table that groups the sequence and peptidoforms, more so for downstream analyses, but is not necessary. I use

NOTE: These methods (create_unimod_codename_mapping and create_peptidoform_group_mapping) introduce two new dependencies/suggestions. The first dependency is from pyopenms import AASequence required for handling the modified sequence strings conversions from UniMod to CodeName. The second is more of a suggested module, but not necessary. I use Dask to create a partioned DataFrame, which allows for parallel computation on the small subset data frames. If the user doesn't have this module, then it will just perform regular groupby and apply serially.

Run Command

pyprophet edit-osw --in test_data.osw

Help

sage: pyprophet edit-osw [OPTIONS]

  Edit/Add Tables OSW sqlite database file

Options:
  --in PATH                       PyProphet input file.  [required]
  --out PATH                      PyProphet output file.
  --unimod_codename_table / --no-unimod_codename_table
                                  Add a UNIMOD_TO_CODENAME Mapping table.
                                  [default: unimod_codename_table]
  --peptidoform_grouping / --no-peptidoform_grouping
                                  Add a PEPTIDOFORM_GROUP Column to PEPTIDE
                                  Table.  [default: no-peptidoform_grouping]
  --threads INTEGER               Number of threads used for generating
                                  mappings, using dask partioned dataframes.
                                  -1 means all available CPUs.  [default: 1]
  --help                          Show this message and exit.

Example

Using the following export command: pyprophet export --in merged_original.osw --out merged.tsv --format=legacy_merged --max_rs_peakgroup_qvalue=0.2 --ipf_max_peptidoform_pep=0.4 --max_global_peptide_qvalue=0.01 --max_global_protein_qvalue=0.01 --no-transition_quantification

If we look at a specific example sequence, IASPIQHEHDSGSR from the synthetic phospho dilution series data

Library

The library has two experimentally seen peptidoforms IAS(Phospho)PIQHEHDSGSR(Label:13C(6)15N(4)) and IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4)), and then there is also a theoretical generated peptidoform IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4))

     ID UNMODIFIED_SEQUENCE                           MODIFIED_SEQUENCE DECOY
1: 1825      IASPIQHEHDSGSR IAS(Phospho)PIQHEHDSGSR(Label:13C(6)15N(4))     0
2: 1827      IASPIQHEHDSGSR       IAS(UniMod:21)PIQHEHDSGSR(UniMod:267)     0
3: 1841      IASPIQHEHDSGSR IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4))     0  
4: 1842      IASPIQHEHDSGSR       IASPIQHEHDS(UniMod:21)GSR(UniMod:267)     0
5: 1843      IASPIQHEHDSGSR IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4))     0 <- Theoretical form

Current Export Results

If we look at the results that get exported, one of the experimentally seen forms IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4)) is identified in run 2848806567456108792. The theoretical peptidoform IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4)) also get reported. Depending on what cutoff you use for m_score, some of these would probably get removed, so maybe it's not that big of deal? But there are a few cases where it might still pass through the cutoff as in 2848806567456108792, it gets an m_score of 0.009810127. As mentioned above, there are also duplciate entries of the same feature, from different assays; in 2848806567456108792 there is -4795326927976004004 and -7026778265845113779 mapping to the same RT.

                 run_id                             FullPeptideName                   id       RT ms2_m_score     m_score
1: -6705156405777379722 IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4)) -1351293215868780783  999.479 0.002784220 0.018014322
2: -6705156405777379722 IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4))  8897944053872681987  999.479 0.002784220 0.021234039
3:  2848806567456108792 IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4))  2234005514377159751 1006.080 0.003506055 0.062661831
4:  2848806567456108792 IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4)) -4795326927976004004 1006.800 0.002784220 0.009810127
5:  2848806567456108792 IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4)) -7026778265845113779 1006.800 0.002784220 0.010875793
6:  6101854818078396380 IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4)) -1445808813367347286 1011.150 0.002784220 0.092236624

Export Results Merged on UNIMOD_CODENAME_MAPPING

If we join the SCORE_IP table on the UniMod peptide ID instead, then we only get the experimentally seen peptidoform in the results. One caveat that comes up now, is the FullPeptideName that is returned is the one with the UniMod entry instead of the CodeName, which I guess would leave the user to have to manually look up what it responds to. If this is an issue, I could add a step to convert all UniMod to CodeName using AASequence.

                run_id                       FullPeptideName                  id      RT ms2_m_score    m_score
1: 2848806567456108792 IASPIQHEHDS(UniMod:21)GSR(UniMod:267) 2234005514377159751 1006.08 0.003506055 0.06266183

Example of the UNIMOD_CODENAME_MAPPING

When generating the UNIMOD_CODENAME_MAPPING, if there is either no matching UniMod peptide ID to a corresponding CodeName peptideID or vice-versa, I simiply put a -1.

unimod_to_codename_mapping

Example of the PEPTIDOFORM_GROUPING column in PEPTIDE table

peptide_peptidoform_group

Allowing Color Adjustment for Color Blind Friendly Reports

This is a minor addition to add color blind friendly color adjustment to the reports, as requested by #100

This change includes color_blind_friendly() method for extracting a pair of color blind friendly colors. There are 4 color palettes: 'normal'. 'protan', 'deutran', or 'tritan'. This introduces an extra argument color_palette for extracting one of the 4 color palette pairs, with the default being set to 'normal'.

Example

pyprophet score --in merged_original.osw --color_palette deutran
pyprophet peptide --in merged_original.osw --context global --color_palette protan
pyprophet protein --in merged_original.osw --context global --color_palette tritan

color_blind_friendly

Report and Message when pi0 Estimation fails

I extended the error messaged when the estimated pi0 <= 0, to show what the current lambda values were used. I also included a report pdf of the pvalue histogram used in the pi0 estimation, to give the user a better idea of why it may have failed.

Error: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda. Current lambda: [0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95]

Example

image

grosenberger commented 2 years ago

Thanks for the PR! I think there are many interesting ideas that you brought up.

Regaring the export of peptidoforms not part of the library: In the ideal world scenario, the spectral library would contain the coordinates for all peptidoforms of interest to achieve maximum sensitivity. With the DL-based predictors, this would be possible nowadays, but with the experimental libraries that we used before, the libraries have typically been sparse and site-localization has not always been very confident. For this reason, an assumption of IPF was to not trust the site-localization at all and to always infer this from the theoretical model. The peptidoforms that are then exported by default might differ from the original query site-localization but are typically better supported by the data. In some cases, we even export multiple peptidoforms and their scores for the same assay, with the idea that TRIC will sort this out. It's a bit convoluted in the "pyprophet export" code, but the idea is that TRIC would actually align over the peptidoform confidence instead of the peakgroup confidence.

So this was the idea behind that original implementation. I think your suggestion might make sense if we would use a complete library covering all options. In the original implementation, we just reported all results, let IPF and TRIC decide what is supported best by the data and then we would remove redundancy later on. It's important to note though, that the default export parameters try to prevent double assignment of the same peak group to different peptidoforms. That has frequently been a concern with the implementation, but with the measures taken, I have found that the approach to be quite robust.

How do you think about it? Would that modification make sense in your workflow? If yes, I think it would be great if the documentation and help could be extended to highlight the different usage scenarios.

Color Adjustments: Great, let's merge that. Would it be possible to put this in a separate PR?

Report pi0 Estimation: Same here, let's merge that.

Best regards, George

singjc commented 2 years ago

Thank you for your feedback George!

I agree, ideally we would want the library to contain all potential peptidoforms, and then infer which peptidoforms are best supported by data.

My thoughts for experiment generated libraries, was to only consider the peptidoforms that were experimentally identified, because as you mentioned the peptidoform coverage is generally sparse. But yeah I guess, these predicted in silico libraries, like DeepPhospho, would address this issue.

Sure, I can add to the openswath.org documentation to highlight these different scenarios. I guess the trade-off would be limiting your identifications based on prior experimental evidence, at the potential loss of peptidoforms that were not covered in the library that may actually be present.

I made a separate PR #102 for the color adjustment. I'll make a separate PR for the pi0 estimation report once #102 is merged.

Best,

Justin