MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
77 stars 36 forks source link

no valid spectra #144

Open simonklaes opened 1 year ago

simonklaes commented 1 year ago

Hello, trying to analyze samples that have been a) acquired with a 120min LC gradient and b) acquired with a 60min LC gradient. The spectra from a) work well with msgf+. However, the spectra from b) do not work with msgf+ and lead to error: no valid spectra. Other search engines (OMSSA, SequestHT) work well with a) and b).

MS spectra acquired via Orbitrap Fusion with HCD. RAW-file was converted to mzML with msConvert: --filter "peakPicking true [1,2]" msgf+ was run via Galaxy Server or SearchGUI.

Standard Output:

Running MSGFPlus search... Standard output: Running: /gpfs1/data/galaxy_server/galaxy/database/dependencies/_conda/envs/mulled-v1-54d4756e00e3946332c980b37546f36cf77c371e77d5f3c04aceb89fbc7c9a64/bin/java -Xmx3500m -jar /gpfs1/data/galaxy_server/galaxy/database/dependencies/_conda/envs/mulled-v1-54d4756e00e3946332c980b37546f36cf77c371e77d5f3c04aceb89fbc7c9a64/share/msgf_plus-2020.08.05-0/MSGFPlus.jar -s in/Tp1090_01_mzML.mzML -o /tmp/20230106_175040_node064_162279_1/msgfplus_output.mzid -d database/DecoyDatabase_on_data_5out.fa -t 5ppm -ti 0,1 -tda 0 -m 3 -inst 1 -e 1 -protocol 0 -ntt 2 -minLength 6 -maxLength 40 -minCharge 2 -maxCharge 4 -maxMissedCleavages -1 -n 1 -addFeatures 1 -tasks 0 -thread 1 -mod /tmp/20230106_175040_node064_162279_1/msgfplus_mods.txt MS-GF+ Release (v2020.07.02) (5 August 2020) Java 1.8.0_265 (Azul Systems, Inc.) Linux (amd64, version 3.10.0-1160.el7.x86_64) Loading database files... Creating the suffix array indexed file... Size: 77151 AlphabetSize: 28 Sorting suffixes... Size: 17210368 Counting number of distinct peptides in DecoyDatabase_on_data_5out.csarr using DecoyDatabase_on_data_5__out.cnlcp Loading database finished (elapsed time: 0.48 sec) Reading spectra... Skip spectrum controllerType=0 controllerNumber=1 scan=642 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=657 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=681 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=690 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=698 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=707 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=756 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=772 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=822 since it is not centroided Skip spectrum controllerType=0 controllerNumber=1 scan=846 since it is not centroided ... Ignoring 2636 profile spectra. Ignoring 0 spectra having less than 10 peaks.

Standard error: Picked up _JAVA_OPTIONS: -Xmx46g -Xms256m [Error] in/Tp1090_01_mzML.mzML does not have any valid spectra Process '/gpfs1/data/galaxy_server/galaxy/database/dependencies/_conda/envs/mulled-v1-54d4756e00e3946332c980b37546f36cf77c371e77d5f3c04aceb89fbc7c9a64/bin/java' did not finish successfully (exit code: ÿ). Please check the log.

MSGFPlusAdapter took 01:02 m (wall), 0.02 s (CPU), 0.01 s (system), 0.01 s (user); Peak Memory Usage: 25 MB.

Fileinfo says:

Peak type from metadata (or estimated from data) level 1: Centroid (Centroid) level 2: Centroid (Centroid)

FarmGeek4Life commented 1 year ago

MS-GF+ looks for the "Centroid Spectrum" CV Param on each scan (it looks for accession 'MS:1000127'): https://github.com/MSGFPlus/msgfplus/blob/master/src/main/java/edu/ucsd/msjava/mzml/SpectrumConverter.java#L36 That output is saying that it only found profile scans in that file. Please confirm that the scans are centroided, as I have not previously seen this problem when the spectra in the mzML file do include the "Centroid Spectrum" CV Param, but it's possible if for some reason the accession is 'PSI-MS:1000127' (which I have not seen in mzML files previously, but have seen in mzid files)

simonklaes commented 1 year ago

Please confirm that the scans are centroided, as I have not previously seen this problem when the spectra in the mzML file do include the "Centroid Spectrum" CV Param, but it's possible if for some reason the accession is 'PSI-MS:1000127' (which I have not seen in mzML files previously, but have seen in mzid files)

cvParam looks fine: cvParam cvRef="MS" accession="MS:1000127" name="centroid spectrum" value=""/

FarmGeek4Life commented 1 year ago

If both files contain that (and do not contain a "profile spectrum" entry on the same scan), and MS-GF+ still does not work on only one of them, then we will need to see data files to determine what is happening.

simonklaes commented 1 year ago

If both files contain that (and do not contain a "profile spectrum" entry on the same scan), and MS-GF+ still does not work on only one of them, then we will need to see data files to determine what is happening.

I invited you to my private repository containing the files for reproducing the issue.

FarmGeek4Life commented 1 year ago

What is happening: MS-GF+ does a secondary check on each MSn spectrum on the median PPM difference between each peak; if the median difference is less than 50 PPM, then it marks the scan as not centroided; I tested one spectrum, and that median PPM difference is 41.077 PPM; it looks like the data is overall very clean, while most of the data points are clustered together tightly; there's even a case in that scan where there are 5 consecutive peaks that have less than 20 PPM difference between each consecutive peak.

simonklaes commented 1 year ago

Thanks for the quick reply. I would really appreciate it if the minimum median difference could be set freely or the secondary check could be turned off completely.

FarmGeek4Life commented 1 year ago

I'm looking at adding a parameter to allow ignoring the result of the secondary check if the input file says the spectrum is centroided.

FarmGeek4Life commented 1 year ago

See https://github.com/MSGFPlus/msgfplus/releases/tag/v2023.01.12 The zip file contains an updated MS-GF+ jar file that supports the parameter '-allowDenseCentroidedPeaks 1'. If you run a search without that parameter and centroid spectra are ignored because of the check mentioned previously, they are reported separately and the parameter is mentioned in the output.