java error on msgf search results

chrishuges commented 3 years ago

Hi,

I am seeing an issue on processing of some raw files using a pipeline with X!Tandem, MS-GF, and Comet. This is with versions 4.0.4 of SearchGUI and 2.0.5 of PeptideShaker. I am running using the CLI versions on a Linux system (CentOS 7). The error is:

An error occurred while loading spectrum matches from 'f04538_Prot_01_01.raw.msgf.mzid.gz'. This file will be ignored. Error: java.lang.ArrayIndexOutOfBoundsException: Index 39266 out of bounds for length 39266 See resources/PeptideShaker.log for details.

I am re-processing the CCLE proteomics data (https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=02cd1b6a7c674f3ebdbed300b5d9aa57) using a standard shell script I use for my typical analyses of this type. I have attached the script output, dataProcessing.txt. You can see that PeptideShaker fails to read in the MSGF results for specific files only (fractions 1, 3, 4, 5, 8). This is reproducible in the sense that it fails on these same fractions each time. The input parameters are here. In the PeptideShaker log, this is the output it has for the ones where it appears to fail:

Sun Dec 20 01:26:25 PST 2020: PeptideShaker version 2.0.5.
Memory given to the Java virtual machine: 128849018880.
Total amount of memory in the Java virtual machine: 2113929216.
Free memory: 2097782744.
Java version: 14.0.2.
java.lang.ArrayIndexOutOfBoundsException: Index 38796 out of bounds for length 38796
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parsePsm(MzIdentMLIdfileReader.java:757)
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parseFile(MzIdentMLIdfileReader.java:301)
    at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.getAllSpectrumMatches(MzIdentMLIdfileReader.java:202)
    at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:466)
    at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
    at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:219)
    at eu.isas.peptideshaker.cmd.PeptideShakerCLI.createProject(PeptideShakerCLI.java:1173)
    at eu.isas.peptideshaker.cmd.PeptideShakerCLI.call(PeptideShakerCLI.java:243)
    at eu.isas.peptideshaker.cmd.PeptideShakerCLI.main(PeptideShakerCLI.java:1389)
java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException

Any ideas why I might be seeing this error? If I open these individual raw files in XCalibur software, they seem to be fine and not corrupt in any way. Also, X!Tandem and Comet seem to work fine on them. This pipeline and parameters I am using operate fine on my own data, it is just with these CCLE data that I have seen this issue.

Thanks, Chris

hbarsnes commented 3 years ago

Hi Chris,

The problems is in the mapping between the MS-GF+ mzid results file and the corresponding spectrum file. We try to do this via the index element in the SpectrumIdentificationResult tag, but for some reason this seems to fail for some of the MS-GF+ files.

How is the raw file converted? Is it done internally in SearchGUI via ThermoRawFileParser?

Could you share one of the mzid files that fail plus the raw file used? Hopefully I can then reproduce the issue and come up with a fix.

Best regards, Harald

chrishuges commented 3 years ago

Hi Harald,

Interesting. The raw file is converted using RawTools so that I can extract the TMT values at the same time. Happy to share the files. How should I get them to you?

Cheers, Chris

hbarsnes commented 3 years ago

Hi Chris,

The raw file is converted using RawTools so that I can extract the TMT values at the same time.

Aha, I assume you are converting to mzML then and using it as input in SearchGUI? If so, which MS levels are you including in the mzML file? I think that our mzIdentML parsing (or at least the mapping back to the spectra) assumes that the mzML file only includes the MS2 spectra. At least if the index option in the mentioned SpectrumIdentificationResult tag is used.

Perhaps you can try providing the raw file directly to SearchGUI just to see if that works?

How should I get them to you?

Probably best to use something like Dropbox and share the link?

Would be great to get one raw file, the RawTools-converted mzML and the resulting MS-GF+ mzIdentML file.

Best regards, Harald

chrishuges commented 3 years ago

Hi Harald,

Actually it is an MGF. RawTools doesn't do mzML conversion. Only the MS2 is found in the MGF.

I have dropped the raw file, the converted MGF, and the ZIP archive of the search results here. Let me know if you have any trouble getting the files.

Cheers, Chris

hbarsnes commented 3 years ago

Hi Chris,

As far as I can tell there is an error in the given mzid file produced by MS-GF+ with regards to the use of the index element in the SpectrumIdentificationResult tag. According to the mzid documentation the index is used for "referencing peak list files with multiple spectra, i.e. MGF, PKL, merged DTA files. Index is the spectrum number in the file, starting from 0".

In your MS-GF+ mzid file the following index is used spectrumID="index=39266", however, the mgf file provided as input only contains 39266 spectra, hence the maximum valid index should be 39265 (given that it is zero-based).

A quick look at the corresponding mgf file indicates that the spectrum they are trying to refer to for the given psm is indeed the final spectrum in the mgf file. At least the properties seem to match.

On the other hand, the following index is also used in the same file spectrumID="index=0", so a quick fix of reducing the index by one does not solve the problem either.

Why or how this happens I'm not really sure, but probably a good idea to set up an issue with the MS-GF+ developers: https://github.com/MSGFPlus/msgfplus/issues?

Best regards, Harald

chrishuges commented 3 years ago

Hmm, Ok thanks Harald. I will post an issue over there and will do some more investigation myself using your built-in raw file conversion as well as some other modifications. I will let you know if I learn anything.

Chris

FarmGeek4Life commented 3 years ago

@hbarsnes Did you get a count of spectra in the MGF file by opening it up?

I downloaded and opened it; doing a search/count for "BEGIN IONS" and SeeMS both return a total of 39267 spectra.

Even MS-GF+ reports a total of 39267 spectra in the MGF: 5 were ignored due to too few points (<10), and 39262 were then searched.

hbarsnes commented 3 years ago

@FarmGeek4Life I just checked in PeptideShaker, but I just noticed that we are indeed ignoring one spectrum when indexing the mgf file. Trying to figure out why now. But at least that should explains why the index is off.

@chrishuges I think you can therefore close the MS-GF+ issue as that is not where the issue occurs.

FarmGeek4Life commented 3 years ago

According to SeeMS, there is one spectra in the file with 0 data points; that's probably the one PeptideShaker is ignoring.

Details on the spectra MS-GF+ would have ignored (index and data points): 38899: 0 points 38642: 4 points 38887: 4 points 38649: 7 points 38632: 9 points

hbarsnes commented 3 years ago

@FarmGeek4Life Can you tell me the spectrum title of that spectrum?

FarmGeek4Life commented 3 years ago

TITLE=Spectrum_82657 EditPad and regex search is helpful (I used \.raw\n([ 0-9.]\n){0,9}END IONS)

chrishuges commented 3 years ago

Using the built-in ThermoRawFileParser in the GUI the MS-GF+ input seems to be the same:

MS-GF+ Release (v2020.07.02) (5 August 2020)
Java 15.0.1 (Oracle Corporation)
Windows 10 (amd64, version 10.0)
Loading database files...
Warning: Sequence database contains 72 counts of letter 'U', which does not correspond to an amino acid.
Warning: Sequence database contains 134 counts of letter 'X', which does not correspond to an amino acid.
Creating the suffix array indexed file... Size: 22857707
AlphabetSize: 28
Suffix creation: 54.69% complete.
Sorting suffixes... Size: 17210368
Sorting: 34.28% complete.
Sorting: 66.24% complete.
Counting number of distinct peptides in uniprotHumanCrapDec2020_concatenated_target_decoy.csarr using uniprotHumanCrapDec2020_concatenated_target_decoy.cnlcp
Counting distinct peptides: 72.19% complete.
Loading database finished (elapsed time: 13.91 sec)
Reading spectra...
Skip spectrum controllerType=0 controllerNumber=1 scan=4 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=9 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=39 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=48 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=110 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=117 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=170 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=178 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=209 since activationMethod is HCD, not CID
Skip spectrum controllerType=0 controllerNumber=1 scan=215 since activationMethod is HCD, not CID
 ...
Ignoring 0 profile spectra.
Ignoring 5 spectra having less than 10 peaks.
Reading spectra finished (elapsed time: 86.82 sec)
Using 8 threads.
Search Parameters:
    PrecursorMassTolerance: 1.0 Da
    IsotopeError: 0,0
    TargetDecoyAnalysis: false
    FragmentationMethod: CID
    Instrument: LowRes (Low-res LCQ/LTQ)
    Enzyme: Tryp
    Enzyme file: D:\chughes\software\SearchGUI-4.0.7\resources\MS-GF+\params\enzymes.txt
    Enzyme info: Added new enzyme Trypsin_(no_P_rule) with target residues RK
    Enzyme info: Added new enzyme CNBr with target residues M
    Enzyme info: Added new enzyme Pepsin_A with target residues LF
    Enzyme info: Added new enzyme Thermolysin with target residues AILMFV
    Enzyme info: Added new enzyme Lys-C_(no_P_rule) with target residues K
    Enzyme info: Added new enzyme Arg-C_(no_P_rule) with target residues R
    Enzyme info: Added new enzyme Chymotrypsin_(no_P_rule) with target residues YLFW
    Enzyme info: Added new enzyme Asp-N_(ambic) with target residues DE
    Enzyme info: Added new enzyme Arg-N with target residues R
    Enzyme info: Added new enzyme LysargiNase with target residues RK
    Protocol: TMT
    NumTolerableTermini: 2
    MinPepLength: 8
    MaxPepLength: 30
    MinCharge: 2
    MaxCharge: 4
    NumMatchesPerSpec: 10
    MaxMissedCleavages: 2
    MaxNumModsPerPeptide: 2
    ChargeCarrierMass: 1.00727649 (proton)
    MinNumPeaksPerSpectrum: 10
    NumIsoforms: 128
Post translational modifications in use:
    Fixed (static):     Carbamidomethyl on C (+57.0215)
    Fixed (static):     TMT6plex on K (+229.1629)
    Fixed (static):     TMT6plex on * at the peptide N-terminus (+229.1629)
    Variable (dynamic): Oxidation on M (+15.9949)

Spectrum 0-39261 (total: 39262)
Splitting work into 24 tasks.
Search progress: 0 / 24 tasks, 0.00%        0.00 seconds elapsed

So the same MS2 spectra are being input here. This highlights a change I should make to RawTools to implement a filter to not output these low quality MS2 spectra. Realistically an MS2 with 0 points shouldn't make it into the MGF.

Edit - PeptideShaker passes the MS-GF+ result without issue on the ThermoRawFileParser converted file in the GUI.

hbarsnes commented 3 years ago

Realistically an MS2 with 0 points shouldn't make it into the MGF.

Probably not. I will still look into whether I can find a workaround on our end when coming across such files though. As we indeed kick them out when indexing the mgf.

hbarsnes commented 3 years ago

@chrishuges I've just released new versions of both SearchGUI and PeptideShaker that should solve the problem and allow you to load results also from mgf files with empty spectra. Important: Note that you will have to manually delete the old mgf indexes first. These are the cms files located next to the mgf files (and with the same file names as the mgf files they are indexing). Please let me know if you still experience any problems with the new versions and I'll reopen the issue.

@FarmGeek4Life Thanks for the help on debugging the issue! And sorry for initially trying to blame it on MS-GF+. ;)

FarmGeek4Life commented 3 years ago

@hbarsnes You're welcome! I understand how it looked to you, looking at it through tools you often use, and I think most of us have been guilty of that at some point - especially when dealing with files that have something uncommon in them.

compomics / searchgui

java error on msgf search results #275