compomics / ms2rescore

Modular and user-friendly platform for AI-assisted rescoring of peptide identifications
https://ms2rescore.readthedocs.io
Apache License 2.0
39 stars 14 forks source link

Uncaught exception in DeepLCFeatureGenerator if not enough peptides for calibration set #130

Open vrkosk opened 3 months ago

vrkosk commented 3 months ago

I'm getting an uncaught exception when trying to use ms2rescore.feature_generators.ms2pip.DeepLCFeatureGenerator. The error happens when there are not enough peptides in psm_list for the calibration set.

Here's how I create the environment:

C:\python\python309\python.exe -m venv venv_309_ms2rescore
venv_309_ms2rescore\Scripts\pip3 install ms2rescore==3.0.2

I'm calling the feature generator as instructed in MS2Rescore docs:

    fgen = DeepLCFeatureGenerator(
        lower_score_is_better=True, # because we use expect value as 'score'
        spectrum_path=None, # not relevant
        processes=processes,
        deeplc_retrain=False,
        calibration_set_size=0.15,
    )

    fgen.add_features(psm_list)

When there are only a few items in psm_list, there's an uncaught exception:

2024-03-22 11:17:35,204 INFO Running DeepLC for PSMs from run (1/1): `F981141_1.tsv9ig132dw.mgf`...
Traceback (most recent call last):
  File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 243, in <module>
    main()
  File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 218, in main
    _add_DeepLC_features(
  File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 126, in _add_DeepLC_features
    fgen.add_features(psm_list)
  File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\ms2rescore\feature_generators\deeplc.py", line 163, in add_features
    seq_df=self._psm_list_to_deeplc_peprec(psm_list_calibration)
  File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\ms2rescore\feature_generators\deeplc.py", line 211, in _psm_list_to_deeplc_peprec
    peprec = peprec.rename(
  File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\pandas\core\frame.py", line 3813, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\pandas\core\indexes\base.py", line 6070, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\pandas\core\indexes\base.py", line 6130, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['tr', 'seq', 'modifications'], dtype='object')] are in the [columns]"

The workaround in my script is to pass calibration_set_size=1.0 when round(calibration_set_size * len(psm_list[~psm_list['is_decoy']])) == 0. Then _psm_list_to_deeplc_peprec() gets a non-empty array and all is fine. Quite likely I shouldn't even use DeepLC if there aren't enough peptide matches!

RalfG commented 2 months ago

Hi, @vrkosk,

Thanks for reporting! We will look into this.

Best, Ralf

RalfG commented 2 months ago

For internal reference:

_psm_list_to_deeplc_peprec() has already been removed in the timsRescore branch in favor of sending the PSMList directly to DeepLC. However, we should still look into how this behaves when there are not enough PSMs (or none) for calibration.