compomics / compomics-utilities

Open source Java library for computational proteomics
http://compomics.github.io/projects/compomics-utilities.html
29 stars 17 forks source link

MGF reader assumes TITLE lines to always be url-encoded #61

Closed elrubio closed 10 months ago

elrubio commented 10 months ago

The MgfFileIterator expects spectrum titles (TITLE=) in MGF files to be url-encoded.

This leads to errors (e.g. in PeptideShaker: "Spectrum not found") in cases where titles are in fact not url-encoded but contain decodable characters, e.g. a plus sign "+" that gets decoded into a space char.

I've never worked with MGF files having encoded titles and would argue that this is a wrong implementation of the MGF "standard" [1]. I would welcome a change that has decoding disabled by default. Although not trivial, you could guess if the string was encoded in the first place [2] or make decoding optional. The current behaviour actually requires me to change many files every time I want to parse them with PeptideShaker.

[1] Mascot itself doesn't use or mention url-encoded strings in its documentation and shows TITLE lines as not encoded. Other libraries like pyteomics don't decode either.

[2] Even if urldecode(s) == s.replace("+", " ") you couldn't tell between encoded spaces and true plus signs.

hbarsnes commented 10 months ago

I'm trying to figure out why we added the decoding of the spectrum titles in the first place. As I'm pretty sure we did not add it without reason... But I guess one can always argue that it was added for a non-standard mgf file and that most mgf files will not have encoded titles. I will dig a bit deeper and see what I can find.

In the meantime, I've added the following fix: String plusEncoded = spectrumTitle.replaceAll("\+", "%2b"); // special case for titles containing '+' spectrumTitle = URLDecoder.decode(plusEncoded, "utf-8");

Looks a bit like a hack that should not be required, but seems to solve the problem for your files at least?

I've updated the beta version so that you can have a try: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/3.0.0-beta/PeptideShaker-3.0.0-beta.zip

elrubio commented 10 months ago

Thank you for that ultra-quick fix. While this will surely solve my problem, it will raise the inverted issue for url-encoded titles that contained a space. Their space will now be converted to a "+". I'm afraid it's not possible to differentiate between these two cases:

  1. My fancy title --> urlencode() --> TITLE=My+fancy+title
  2. My+fancy+title --> no encoding --> TITLE=My+fancy+title
hbarsnes commented 10 months ago

I found the reason for the change! It was changed way back in Oct 24, 2012: https://github.com/compomics/compomics-utilities/commit/4f4af8bc9ccf762dd51868922ca92999e63b7af1.

And the reason given was "The spectrum titles in mgf files are now decoded from unicode to be compatible with unicode encoded titles in Mascot dat files."

Hence I'm assuming that it would now be safe to remove this decoding as it no longer seems to be required and instead causes issues.

I guess one can say that we have finally come full circle? And it only took 11 years. ;)

I've updated the beta version once more so that you can have another try (this time without any decoding or hacks to avoid it): https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/3.0.0-beta/PeptideShaker-3.0.0-beta.zip

hbarsnes commented 10 months ago

Note that you probably have to delete your old mgf index files, i.e. the cms files located next to the mgf files, for the fix to work. As otherwise the old indexes will simply be used instead.

elrubio commented 10 months ago

Works perfectly. A full circle indeed. Thank you so much for the ultra fast changes. 🙌