MannLabs / alphapeptdeep

Deep learning framework for proteomics
Apache License 2.0
102 stars 20 forks source link

Retrain model from a spectral library tsv file containing custom modifications #125

Closed cctsou closed 6 months ago

cctsou commented 6 months ago

Hi

I am trying to retrain the MS2 model using a spectral library file (tsv format) which contains a custom modification on cysteine.

Here is the output I got, it looks like most of spectra were skipped because the unknown modification. Could you help me to figure out which part I did wrong? Here I attached the ymal file and the trimmed version of the library tsv file. Thank you very much in advance.

peptdeep_transfer_2024-01-05--11-10-14.400893.yaml.txt NCIH_KRAS_GPF_Library_1_plus_2.report-lib_trimmed.tsv.txt

C:\>peptdeep transfer C:\Users\Chih-ChiangTsou\peptdeep\peptdeep_transfer_2024-01-05--11-10-14.400893.yaml

     ____             __  ____
    / __ \___  ____  / /_/ __ \___  ___  ____
   / /_/ / _ \/ __ \/ __/ / / / _ \/ _ \/ __ \
  / ____/  __/ /_/ / /_/ /_/ /  __/  __/ /_/ /
 /_/    \___/ .___/\__/_____/\___/\___/ .___/
           /_/                       /_/
....................................................
.                      1.1.1                       .
.       https://github.com/MannLabs/peptdeep       .
.                    Apache 2.0                    .
....................................................

2024-01-05 11:25:58> [PeptDeep] Running train task ...
2024-01-05 11:25:58> Platform information:
2024-01-05 11:25:58> system        - Windows
2024-01-05 11:25:58> release       - 10
2024-01-05 11:25:58> version       - 10.0.22631
2024-01-05 11:25:58> machine       - AMD64
2024-01-05 11:25:58> processor     - Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
2024-01-05 11:25:58> cpu count     - 16
2024-01-05 11:25:58> ram           - 56.0/79.7 Gb (available/total)
2024-01-05 11:25:58>
2024-01-05 11:25:58> Python information:
2024-01-05 11:25:58> alphabase        - 1.2.0
2024-01-05 11:25:58> alpharaw         - 0.4.0
2024-01-05 11:25:58> biopython        - 1.82
2024-01-05 11:25:58> click            - 8.1.7
2024-01-05 11:25:58> lxml             - 5.0.0
2024-01-05 11:25:58> numba            - 0.58.1
2024-01-05 11:25:58> numpy            - 1.22.3
2024-01-05 11:25:58> pandas           - 1.4.2
2024-01-05 11:25:58> peptdeep         - 1.1.1
2024-01-05 11:25:58> psutil           - 5.9.7
2024-01-05 11:25:58> pyteomics        - 4.6.3
2024-01-05 11:25:58> python           - 3.10.4
2024-01-05 11:25:58> scikit-learn     - 1.3.2
2024-01-05 11:25:58> streamlit        - 1.29.0
2024-01-05 11:25:58> streamlit-aggrid - 0.3.4.post3
2024-01-05 11:25:58> torch            - 2.1.2
2024-01-05 11:25:58> tqdm             - 4.66.1
2024-01-05 11:25:58> transformers     - 4.36.2
2024-01-05 11:25:58>
2024-01-05 11:26:00> Loading PSMs and extracting fragments ...
794604 Entries with unknown modifications are removed
100%|████████████████████████████████████████████████████████████████████████████| 4052/4052 [00:01<00:00, 3027.46it/s]
2024-01-05 11:26:21> Loaded 4052 PSMs for training and testing
2024-01-05 11:26:21> Training RT model ...
2024-01-05 11:26:21> 3862 PSMs for RT model training/transfer learning
2024-01-05 11:26:21> Training with fixed sequence length: 0
[Training] Epoch=1, lr=1e-05, loss=0.13977318226049343
jalew188 commented 6 months ago

Hi @cctsou , in the yaml settings:

psm_modification_mapping:
      IADTB@C:
      - (IADTB)
      Oxidation@M:
      - (UniMod:35)
      Acetyl@Protein_N-term:
      - (UniMod:1)

must be changed to

psm_modification_mapping:
      IADTB@C:
      - C(IADTB)
      Oxidation@M:
      - M(UniMod:35)
      Acetyl@Protein_N-term:
      - _(UniMod:1)

Let me know if this addresses your issue:)

cctsou commented 6 months ago

Hi @jalew188 , thanks, I did try that first but still, those entries were considered unknown modifications. Any suggestions? Since our IADTB mod is the same mass as UniMod:2062, I also tried replacing all strings C(IADTB) to C(UniMod:2062) in the spec tsv and also changed the yaml accordingly. but still no luck.

jalew188 commented 6 months ago

@cctsou This is indeed a bug, I have fixed it in v1.1.3.

cctsou commented 6 months ago

It's working, thanks a lot!!

cctsou commented 6 months ago

A follow-up question: I was able to get the refined model and I am trying to use it to predict a new library given a FASTA file, but I encountered the following error message, it looks like it has something to do with the user-defined mod. Any clue?

[PeptDeep] Starting a new job 'C:\Users\Chih-ChiangTsou/peptdeep/tasks/queue\peptdeep_library_2024-01-06--12-47-56.749038.yaml'...
[PeptDeep] Predicting library ...
2024-01-06 12:47:59> [PeptDeep] Running library task ...
2024-01-06 12:47:59> Input files (fasta): ['D:/fasta/test.fasta']
2024-01-06 12:47:59> Platform information:
2024-01-06 12:47:59> system        - Windows
2024-01-06 12:47:59> release       - 10
2024-01-06 12:47:59> version       - 10.0.22631
2024-01-06 12:47:59> machine       - AMD64
2024-01-06 12:47:59> processor     - Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
2024-01-06 12:47:59> cpu count     - 16
2024-01-06 12:47:59> ram           - 56.2/79.7 Gb (available/total)
2024-01-06 12:47:59>
2024-01-06 12:47:59> Python information:
2024-01-06 12:47:59> alphabase        - 1.2.0
2024-01-06 12:47:59> alpharaw         - 0.4.0
2024-01-06 12:47:59> biopython        -
2024-01-06 12:47:59> click            - 8.1.7
2024-01-06 12:47:59> lxml             - 4.9.4
2024-01-06 12:47:59> numba            - 0.58.1
2024-01-06 12:47:59> numpy            - 1.26.2
2024-01-06 12:47:59> pandas           - 2.1.4
2024-01-06 12:47:59> peptdeep         - 1.1.3
2024-01-06 12:47:59> psutil           - 5.9.7
2024-01-06 12:47:59> pyteomics        - 4.6.3
2024-01-06 12:47:59> python           - 3.9.18
2024-01-06 12:47:59> scikit-learn     - 1.3.2
2024-01-06 12:47:59> streamlit        - 1.29.0
2024-01-06 12:47:59> streamlit-aggrid -
2024-01-06 12:47:59> torch            - 2.1.2
2024-01-06 12:47:59> tqdm             - 4.66.1
2024-01-06 12:47:59> transformers     - 4.36.2
2024-01-06 12:47:59>
2024-01-06 12:48:01> Using external ms2 model: 'C:/Users/Chih-ChiangTsou/peptdeep/refined_models/ms2.pth'
2024-01-06 12:48:01> Using external rt model: 'C:/Users/Chih-ChiangTsou/peptdeep/refined_models/rt.pth'
2024-01-06 12:48:01> Using external ccs model: 'C:/Users/Chih-ChiangTsou/peptdeep/refined_models/ccs.pth'
2024-01-06 12:48:01> xxx/library.tsv does not exist, use default IRT_PEPTIDE_DF to translate irt
2024-01-06 12:48:01> Generating the spectral library ...
2024-01-06 12:48:01> Loaded 17865 precursors.
2024-01-06 12:48:01> Predicting RT/IM/MS2 for 16892 precursors ...
2024-01-06 12:48:01> Using multiprocessing with 16 processes ...
2024-01-06 12:48:01> Predicting rt,mobility,ms2 ...
  0%|                                                                                           | 0/31 [00:15<?, ?it/s]
2024-01-06 12:48:18> multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "multiprocessing\pool.py", line 125, in worker
  File "peptdeep\pretrained_models.py", line 914, in _predict_func_for_mp
    return self.predict_all(
  File "peptdeep\pretrained_models.py", line 1084, in predict_all
    self.predict_rt(precursor_df,
  File "peptdeep\pretrained_models.py", line 877, in predict_rt
    df = self.rt_model.predict(precursor_df,
  File "peptdeep\model\model_interface.py", line 388, in predict
    features = self._get_features_from_batch_df(
  File "peptdeep\model\rt.py", line 161, in _get_features_from_batch_df
    self._get_mod_features(batch_df)
  File "peptdeep\model\model_interface.py", line 812, in _get_mod_features
    get_batch_mod_feature(batch_df)
  File "peptdeep\model\featurize.py", line 86, in get_batch_mod_feature
    mod_features_list = batch_df.mods.str.split(';').apply(
  File "pandas\core\series.py", line 4757, in apply
    return SeriesApply(
  File "pandas\core\apply.py", line 1209, in apply
    return self.apply_standard()
  File "pandas\core\apply.py", line 1289, in apply_standard
    mapped = obj._map_values(
  File "pandas\core\base.py", line 921, in _map_values
    return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
  File "pandas\core\algorithms.py", line 1814, in map_array
    return lib.map_infer(values, mapper, convert=convert)
  File "lib.pyx", line 2926, in pandas._libs.lib.map_infer
  File "peptdeep\model\featurize.py", line 87, in <lambda>
    lambda mod_names: [
  File "peptdeep\model\featurize.py", line 88, in <listcomp>
    MOD_TO_FEATURE[mod] for mod in mod_names
KeyError: 'IADTB@C'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "peptdeep\pipeline_api.py", line 416, in generate_library
    lib_maker.make_library(lib_settings['infiles'])
  File "peptdeep\spec_lib\library_factory.py", line 105, in make_library
    self._predict()
  File "peptdeep\spec_lib\library_factory.py", line 68, in _predict
    self.spec_lib.predict_all()
  File "peptdeep\spec_lib\predict_lib.py", line 121, in predict_all
    res = self.model_manager.predict_all(
  File "peptdeep\pretrained_models.py", line 1127, in predict_all
    return self.predict_all_mp(
  File "peptdeep\pretrained_models.py", line 964, in predict_all_mp
    for ret_dict in process_bar(
  File "peptdeep\utils.py", line 27, in process_bar
    for i,iter in enumerate(iterator):
  File "multiprocessing\pool.py", line 870, in next
KeyError: 'IADTB@C'

'IADTB@C'`