compomics / ms2rescore

Modular and user-friendly platform for AI-assisted rescoring of peptide identifications
https://ms2rescore.readthedocs.io
Apache License 2.0
51 stars 15 forks source link

Various issues while processing PeptideShaker mzIdentML results #165

Closed ofleitas closed 2 months ago

ofleitas commented 3 months ago

Hello

I am trying to run ms2rescore but get the following error :

Reading PSMs from file...
Reading PSMs from PSM file (1/1): `C:/Users/ofm83/OneDrive/Documents/Megaphobema_perterklaasi/output_PeptideShaker/Megaphobema_perterklaasi.mzid`...
undefined entity: line 35, column 2
Traceback (most recent call last):
  File "ms2rescore\gui\function2ctk.py", line 301, in run
    self.fn(*self.fn_args, **self.fn_kwargs)
  File "ms2rescore\gui\app.py", line 637, in function
    rescore(configuration=config)
  File "ms2rescore\core.py", line 40, in rescore
    psm_list = parse_psms(config, psm_list)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ms2rescore\parse_psms.py", line 28, in parse_psms
    psm_list = _read_psms(config, psm_list)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ms2rescore\parse_psms.py", line 90, in _read_psms
    id_file_psm_list = psm_utils.io.read_file(
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "psm_utils\io\__init__.py", line 158, in read_file
    reader = reader_cls(filename, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "psm_utils\io\mzid.py", line 143, in __init__
    self._source = self._infer_source()
                   ^^^^^^^^^^^^^^^^^^^^
  File "psm_utils\io\mzid.py", line 177, in _infer_source
    mzid_xml = ET.parse(self.filename)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "xml\etree\ElementTree.py", line 1218, in parse
  File "xml\etree\ElementTree.py", line 580, in parse
xml.etree.ElementTree.ParseError: undefined entity: line 35, column 2

What can I do?

RalfG commented 3 months ago

Hi @ofleitas,

Thanks for reaching out!

It seems like there is something in your mzIdentML file that cannot be read. It might be corrupt in some way. The program crashes while reading line 35 of the file. To investigate, you can open the mzIdentML file in any text reader (Notepad, Notepad++, VS Code), as long as they are not too large.

If you can and want, you can also send us the file. I'd be happy to take a look.

Best, Ralf

ofleitas commented 3 months ago

Hello RalfG

I solved the problem associated with line 35, it seems it was because of a special character. But now I am getting this error :

Adding DeepLC-derived features to PSMs. Running DeepLC for PSMs from run (1/1): 20220322_ID_6552... Multiple modifications per site not supported in Peptide Record format. Traceback (most recent call last): File "ms2rescore\gui\function2ctk.py", line 301, in run self.fn(*self.fn_args, **self.fn_kwargs) File "ms2rescore\gui\app.py", line 637, in function rescore(configuration=config) File "ms2rescore\core.py", line 76, in rescore fgen.add_features(psm_list) File "ms2rescore\feature_generators\deeplc.py", line 163, in add_features seq_df=self._psm_list_to_deeplc_peprec(psm_list_calibration) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "ms2rescore\feature_generators\deeplc.py", line 210, in _psm_list_to_deeplc_peprec peprec = peptide_record.to_dataframe(psm_list) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "psm_utils\io\peptide_record.py", line 505, in to_dataframe return pd.DataFrame([PeptideRecordWriter._psm_to_entry(psm) for psm in psm_list]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "psm_utils\io\peptide_record.py", line 505, in return pd.DataFrame([PeptideRecordWriter._psm_to_entry(psm) for psm in psm_list]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "psm_utils\io\peptide_record.py", line 285, in _psm_to_entry sequence, modifications, charge = proforma_to_peprec(psm.peptidoform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "psm_utils\io\peptide_record.py", line 443, in proforma_to_peprec ms2pip_mods.append(_mod_to_ms2pip(mod, i + 1)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "psm_utils\io\peptide_record.py", line 433, in _mod_to_ms2pip raise InvalidPeprecModificationError( psm_utils.io.peptide_record.InvalidPeprecModificationError: Multiple modifications per site not supported in Peptide Record format.

RalfG commented 3 months ago

Hi @ofleitas,

Regarding the first issue: An encoding problem, most likely. For future reference, a possible fix could be to open the mzIdentML file in an editor such as Windows Notepad and saving it again with "UTF-8" encoding specified.

For the Multiple modifications per site error: I believe this issue was fixed in one of the latest releases. Could you check if the problem persists with the latest release?

Best, Ralf

ofleitas commented 3 months ago

I installed the last release and it was solved the multiple modifications per site error. However, now I am getting this error:

Error occurred: index -3 is out of bounds for axis 0 with size 1

RalfG commented 3 months ago

Glad the second issue was also solved by updating.

Can you provide some more information on the error? Could you paste the full log? Thanks!

ofleitas commented 3 months ago

Hello Ralf

Follow the information

Reading PSMs from PSM file (1/1):
'C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/PeptideShaker/Sericopelma_sp.mzid'...
Removed 0 PSMs with rank >= 10.
Found 11699 PSMs, of which 15.46% are decoys.
Non-mapped modifications found: {'Carbamidomethyl', 'Deamidated',
'Oxidation'}
This can be ignored if they are Unimod modification labels.
Found 8383 identified PSMs with rank <= 1 at 0.01 FDR before rescoring.
Adding basic features to PSMs.
Adding MS²PIP-derived features to PSMs.
Running MS²PIP for PSMs from run (1/1) `20220322_ID_6556`...
Processing spectra and peptides...
Adding DeepLC-derived features to PSMs.
Running DeepLC for PSMs from run (1/1): `20220322_ID_6556`...
Percolator output:
Percolator version 3.07.1, Build Date Jun 20 2024 13:21:08

Copyright (c) 2006-9 University of Washington. All rights reserved.

Written by Lukas Käll ***@***.***) in the

Department of Genome Sciences at the University of Washington.

Issued command:

percolator --results-psms
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.percolator.psms.pout
--decoy-results-psms
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.percolator.decoy.psms.pout
--results-peptides
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.percolator.peptides.pout
--decoy-results-peptides
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.percolator.decoy.peptides.pout
--results-proteins
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.percolator.proteins.pout
--decoy-results-proteins
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.percolator.decoy.proteins.pout
--weights
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.percolator.weights.tsv
--verbose 1 --num-threads 16 --post-processing-tdc
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.pin

Started Wed Aug  7 17:36:10 2024

Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10

Finding protein decoy prefix for
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.pin

Using protein decoy prefix ""

Concatenated search input detected and --post-processing-tdc flag set.
Applying target-decoy competition on Percolator scores.

Selecting Cpos by cross-validation.

Selecting Cneg by cross-validation.

Found 8383 test set positives with q<0.01 in initial direction

---Training with Cpos selected by cross validation, Cneg selected by cross
validation, initial_fdr=0.01, fdr=0.01

Found 8970 test set PSMs with q<0.01.

Selected best-scoring PSM per file+scan+expMass (target-decoy competition):
9890 target PSMs and 1809 decoy PSMs.

Tossing out "redundant" PSMs keeping only the best scoring PSM for each
unique peptide.

Calculating q values.

Final list yields 1845 target peptides with q<0.01.

Calculating posterior error probabilities (PEPs).

Removed 0 PSMs with rank >= 1.
Using 0 features:
Found 11699 PSMs.
  - 9890 target PSMs and 1809 decoy PSMs detected.
Assigning confidence...
Performing target-decoy competition...
Keeping the best match per index columns...
- Found 11699 PSMs from unique spectra.
- Found 2421 unique peptides.
Assiging q-values to PSMs...
- Found 8970 PSMs with q<=0.01
Assiging PEPs to PSMs...
Assiging q-values to peptides...
- Found 1846 peptides with q<=0.01
Assiging PEPs to peptides...
Identified 587 (7.00%) more PSMs with rank <= 1 at 0.01 FDR after rescoring.
Writing output to
C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.psms.tsv...
❌ feature weights:
'C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.mokapot.weights.tsv'
❌ log:
'C:/Users/ofm83/OneDrive/Documents/Sericopelma_sp/ms2rescore/ms2rescore.log.txt'
Using 0 features:
Found 11699 PSMs.
  - 9890 target PSMs and 1809 decoy PSMs detected.
Parsing FASTA files and digesting proteins...
  - Parsed and digested 18880 proteins.
  - 15 had no peptides.
  - Retained 18865 proteins.
Matching target to decoy proteins...
Building protein groups...
- Aggregated 18865 proteins into 7621 protein groups.
No decoy sequences were found in the FASTA file.
  - Creating decoy protein groups that mirror the target proteins.
Discarding shared peptides...
  - Discarded 67490 peptides and 58 proteins groups.
  - Retained 363859 peptides from 7563 protein groups.
Assigning confidence...
Performing target-decoy competition...
Keeping the best match per index columns...
- Found 11699 PSMs from unique spectra.
- Found 2421 unique peptides.
Mapping decoy peptides to protein groups...
92 out of 2421 peptides could not be mapped. Please check your digest
settings.
- Found 1016 unique protein groups.
Assiging q-values to PSMs...
- Found 8383 PSMs with q<=0.01
Assiging PEPs to PSMs...
index -3 is out of bounds for axis 0 with size 1
Traceback (most recent call last):
  File "ms2rescore\gui\function2ctk.py", line 301, in run
    self.fn(*self.fn_args, **self.fn_kwargs)
  File "ms2rescore\gui\app.py", line 664, in function
    rescore(configuration=config)
  File "ms2rescore\core.py", line 169, in rescore
    generate.generate_report(
  File "ms2rescore\report\generate.py", line 87, in generate_report
    confidence_before, confidence_after =
get_confidence_estimates(psm_list, fasta_file)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ms2rescore\report\utils.py", line 72, in get_confidence_estimates
    confidence[when] = lin_psm_dataset.assign_confidence(scores=scores)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mokapot\dataset.py", line 607, in assign_confidence
    return LinearConfidence(
           ^^^^^^^^^^^^^^^^^
  File "mokapot\confidence.py", line 375, in __init__
    self._assign_confidence(desc=desc)
  File "mokapot\confidence.py", line 476, in _assign_confidence
    _, pep = qvality.getQvaluesFromScores(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "triqler\qvality.py", line 80, in getQvaluesFromScores
  File "triqler\qvality.py", line 334, in splineEval
IndexError: index -3 is out of bounds for axis 0 with size 1
RalfG commented 3 months ago

Hi @ofleitas,

Thanks for sharing the log. It seems that the issue occurs when calculating PEP values with qvality (through Triqler, through Mokapot). Although, I have not seen this problem before. If you are at liberty to share the input files that lead to this error, that would be very helpful. If I'm not mistaken, a *ms2rescore.psms.tsv file was already written before the error occurred? This file should suffice to help me understand the problem.

Thanks!

RalfG commented 2 months ago

The IndexError occurred due to input scores (before rescoring) that were all either 0 or 100 (PeptideShaker scores on this specific sample), from which PEPs cannot be calculated. The issue is addressed in #182 by catching the error and logging a descriptive warning. This fix will be part of the v3.1.2 release.