compomics / ms2rescore

Modular and user-friendly platform for AI-assisted rescoring of peptide identifications
https://ms2rescore.readthedocs.io
Apache License 2.0
39 stars 14 forks source link

Fatal pickling exception when using [UNIMOD:nn] identifier in ProForma string #128

Closed vrkosk closed 3 months ago

vrkosk commented 3 months ago

I'm getting a mysterious pickling error when trying to use ms2rescore.feature_generators.ms2pip.MS2PIPFeatureGenerator. The error seems to be caused by having [UNIMOD:nn] variable mods in the ProForma string of a peptide.

Here's how I create the environment:

C:\python\python309\python.exe -m venv venv_309_ms2rescore
venv_309_ms2rescore\Scripts\pip3 install ms2rescore==3.0.2

I've written a script that creates a suitable psm_list and MGF file, then the script just calls:

from ms2rescore.feature_generators.ms2pip import MS2PIPFeatureGenerator

...

    fgen = MS2PIPFeatureGenerator(
        model=model,
        ms2_tolerance=ms2_tolerance,
        spectrum_path=mgf_file_name,
        processes=processes,
    )
    fgen.add_features(psm_list)

This works with unmodified peptides (so ProForma sequences like "PEPTIDEK/2"). However, every ProForma sequence with mods ends up with a strange stack trace:

    2024-03-22 09:38:23,962 INFO Running MS²PIP for PSMs from run (1/1) `F981139_1.tsvn36rl336.mgf`...
    2024-03-22 09:38:25,678 INFO Processing spectra and peptides...
    Traceback (most recent call last):
      File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 239, in <module>
        main()
      File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 205, in main
        _add_MS2PIP_features(
      File "C:\Users\villek\githead\mascot-proj\mascot\www\bin\ML_adapters\MS2RescoreAdapter.py", line 108, in _add_MS2PIP_features
        fgen.add_features(psm_list)
      File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\ms2rescore\feature_generators\ms2pip.py", line 207, in add_features
        self._calculate_features(psm_list_run, ms2pip_results)
      File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\ms2rescore\feature_generators\ms2pip.py", line 218, in _calculate_features
        for result, features in zip(
      File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\rich\progress.py", line 168, in track
        yield from progress.track(
      File "C:\Users\villek\tmp\venv_309_ms2rescore\lib\site-packages\rich\progress.py", line 1209, in track
        for value in sequence:
      File "C:\python\Python309\lib\multiprocessing\pool.py", line 420, in <genexpr>
        return (item for chunk in result for item in chunk)
      File "C:\python\Python309\lib\multiprocessing\pool.py", line 870, in next
        raise value
      File "C:\python\Python309\lib\multiprocessing\pool.py", line 537, in _handle_tasks
        put(task)
      File "C:\python\Python309\lib\multiprocessing\connection.py", line 211, in send
        self._send_bytes(_ForkingPickler.dumps(obj))
      File "C:\python\Python309\lib\multiprocessing\reduction.py", line 51, in dumps
        cls(buf, protocol).dump(obj)
    AttributeError: Can't pickle local object 'create_engine.<locals>.connect'

Here's what I've tried in a fresh venv:

Is the issue the ProForma strings in the input? In this case, there's just one peptide (psm_list.to_dataframe().to_csv()):

,peptidoform,spectrum_id,run,collection,spectrum,is_decoy,score,qvalue,pep,precursor_mz,retention_time,ion_mobility,protein_list,rank,source,provenance_data,metadata,rescoring_features
0,IPAM[UNIMOD:35][-63.998285]TIAK/2,query13_rank1_label1,F981123_1.tsvzrkjbv40.mgf,,,False,0.0475696443353316,,,430.732788,,,,,,"{'query': '13', 'rank': '1', 'label': '1'}",{},{}

The ProForma syntax is correct, or at least accepted by https://pyteomics.readthedocs.io/en/latest/api/proforma.html. If I change it to Unimod titles:

,peptidoform,spectrum_id,run,collection,spectrum,is_decoy,score,qvalue,pep,precursor_mz,retention_time,ion_mobility,protein_list,rank,source,provenance_data,metadata,rescoring_features
0,IPAM[UNIMOD:Oxidation][-63.998285]TIAK/2,query13_rank1_label1,F981123_1.tsvmwgalmq5.mgf,,,False,0.0475696443353316,,,430.732788,,,,,,"{'query': '13', 'rank': '1', 'label': '1'}",{},{}

Then I still get the same exception. If I change the input to use deltas instead of Unimod record IDs, like this:

0,IPAM[+15.994915][-63.998285]TIAK/2,query13_rank1_label1,F981123_1.tsvi5cnqcn8.mgf,,,False,0.0475696443353316,,,430.732788,,,,,,"{'query': '13', 'rank': '1', 'label': '1'}",{},{}

Then MS2PIP is happy. Note, the issue is not having multiple mods at the same site. The fault is clearly caused by the [UNIMOD:nn] field.

From grepping the venv, I can tell create_engine comes from sqlalchemy, and it looks like this has something to do with ms2pip's dlib support:

ms2pip\_utils\dlib.py:94:def open_sqlite(filename):
ms2pip\spectrum_output.py:763:        with open_sqlite(filename) as connection:

But I can't see what I can do at the call site to disable dlib (if it's the issue) or what setting to change.

I will work around the issue by only using deltas in ProForma strings.

vrkosk commented 3 months ago

I'm getting the same exception if I use [UNIMOD:nn] with DeepLCFeatureGenerator. Using deltas with DeepLCFeatureGenerator kind of works, but it correctly logs warnings like:

2024-03-22 11:31:07,118 WARNING Skipping the following (not in library): [GenericModification('15.994915', None, None)]

Which means DeepLCFeatureGenerator is useless until the fault is fixed.

ttzzjt commented 3 months ago

Same here, tried different platforms. AttributeError: Can't pickle local object 'create_engine..connect'

ttzzjt commented 3 months ago

Same here, tried different platforms. AttributeError: Can't pickle local object 'create_engine..connect'

Got through by using the docker version.

vrkosk commented 3 months ago

Looking at the current Dockerfile, it specifies ubuntu:focal (20.04 LTS) as the starting point. I'm not sure which Python version it has; Distrowatch says 3.8.

I've installed Python 3.8.10, the latest 3.8.x with a 64-bit Windows installer, made the venv and and tried pip install ms2rescore. Unfortunately, it gets stuck in compiling statsmodels and there are various errors about ms2pip wheels not being available. So I can't confirm or disprove that using Python 3.8 helps on Windows.

RalfG commented 3 months ago

Hi @vrkosk and @ttzzjt,

We are investigating the issue. For now, it seems like downgrading to Pyteomics 4.6 helps to avoid the issue. Just run:

pip install pyteomics==4.6.3

Or update psm_utils, which temporarily has the Pyteomics version restriction in place:

pip install --upgrade psm_utils

I will post here when we know more.

Best, Ralf

vrkosk commented 3 months ago

Yes, downgrading to pyteomics 4.6.3 works.

RalfG commented 3 months ago

A fix will be included in the next Pyteomics release: levitsky/pyteomics#144