Parsing mzIdentML is whitespace sensitive

lazear commented 1 year ago

Hi,

Since we were discussing integration of Sage (https://github.com/compomics/searchgui/issues/334), I wrote an MzIdentML module to write results, since I wanted to play around with PeptideShaker a bit more.

Unfortunately, it appears that parsing Modification (if not other items) appears to be whitespace dependent. The XML library I am using to write the MzIdentML files (serializing from Rust structs) does not support whitespace/indents at this time... I have included links to two minimal examples of the same mzid file (that passes the PSI Validator tool), where one is formatted by an external tool and is loaded in PS fine - the other is the unformatted version that throws the below error:

Formatted mzid: https://gist.github.com/lazear/c7bc428bd7e5227d85a7b5745085c346 Unformatted mzid: https://gist.github.com/lazear/7dd0403d2df1c3f7dd2f0d08c91302f8

Notably, changing any Modification entry in the working file to a single line is sufficient to reproduce the issue.

<Modification monoisotopicMassDelta="15.9949" location="2"><cvParam cvRef="PSI-MS" accession="MS:1001460" name="unknown modification" /></Modification>

Spectrum file is "b1906_293T_proteinID_01A_QE3_122212.raw" from http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD001468

Error message:


Sun Oct 30 11:51:01 PDT 2022: PeptideShaker version 2.2.17.
Memory given to the Java virtual machine: 4294967296.
Total amount of memory in the Java virtual machine: 138412032.
Free memory: 95266672.
Java version: 19.
1714 script command tokens
(C) 2009 Jmol Development
Jmol Version: 12.0.43  2011-05-03 14:21
java.vendor: Homebrew
java.version: 19
os.name: Mac OS X
memory: 54.2/134.2
processors available: 8
useCommandThread: false
WARNING: row index is bigger than sorter's row count. Most likely this is a wrong sorter usage.
java.lang.IllegalArgumentException: Could not parse PTM!
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parsePeptide(MzIdentMLIdfileReader.java:392)
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parseFile(MzIdentMLIdfileReader.java:293)
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.getAllSpectrumMatches(MzIdentMLIdfileReader.java:202)
    at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:466)
    at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
    at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:219)
    at eu.isas.peptideshaker.gui.NewDialog$20.run(NewDialog.java:736)
    at java.base/java.lang.Thread.run(Thread.java:1589)
java.lang.IllegalArgumentException: Could not parse PTM!
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parsePeptide(MzIdentMLIdfileReader.java:392)
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parseFile(MzIdentMLIdfileReader.java:293)
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.getAllSpectrumMatches(MzIdentMLIdfileReader.java:202)
    at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:466)
    at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
    at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:219)
    at eu.isas.peptideshaker.gui.NewDialog$20.run(NewDialog.java:736)
    at java.base/java.lang.Thread.run(Thread.java:1589)

Also, while I'm here... is there a way to completely turn off all of PeptideShaker's filters & validation features? I would love to be able to use it as just a GUI/PSM visualizer that blindly trusts what is in the mzIdentML file - I understand if this doesn't align with the goals of the project though

hbarsnes commented 1 year ago

Yes, you are indeed correct in that it seems like our mzid parsing is formatting-specific. I guess we never considered that anyone would want to write an mzid file without any formatting, as it makes it near impossible to read for humans. I could look into trying to adapt it, but probably better that I prioritize the pin file import instead?

Also, while I'm here... is there a way to completely turn off all of PeptideShaker's filters & validation features? I would love to be able to use it as just a GUI/PSM visualizer that blindly trusts what is in the mzIdentML file - I understand if this doesn't align with the goals of the project though

No, I'm afraid this is not currently supported. It has been talked about, but we concluded that it would require too many changes to the underlying code to be worth the effort. At least with the current limited resources.

lazear commented 1 year ago

Technically XML is supposed to be whitespace agnostic (except where it isn't), and I would assume that mzIdentML files follow that (given that the PSI Validator accepts unformatted mzid's). I can't imagine too many people prefer to read mzIdentMLs over tsv/csv/etc!

Obviously not a pressing issue for me, but figured I would document this in the case of future bugs.

but probably better that I prioritize the pin file import instead

Absolutely - I should be ready very soon.

hbarsnes commented 1 year ago

Technically XML is supposed to be whitespace agnostic

"Should" is the keyword there. ;) But yes, this is clearly something that ought to be fixed in our home made mzid parser. The reason for making our own parser was that the available ones, at least at the time, were all too slow and used too much memory. Our parser only reads through the file once and only extracts the stuff we need and ignores everything else. I will try to get the time to look into improving it later.

compomics / peptide-shaker

Parsing mzIdentML is whitespace sensitive #493