bigbio / quantms

Quantitative mass spectrometry workflow. Currently supports proteomics experiments with complex experimental designs for DDA-LFQ, DDA-Isobaric and DIA-LFQ quantification.
https://quantms.org
MIT License
34 stars 37 forks source link

Error ms2rescore in big dataset for one run. #447

Open ypriverol opened 4 days ago

ypriverol commented 4 days ago

Description of the bug

This command is failing:

#!/usr/bin/env bash

set -e # Exit if a tool returns a non-zero status/exit code
set -u # Treat unset variables and parameters as an error
set -o pipefail # Returns the status of the last command to exit with a non-zero status or zero if all successfully execute
set -C # No clobber - prevent output redirection from overwriting files.

quantmsutilsc ms2rescore \
    --psm_file f07531_Prot_02_R2_F08_msgf.idXML \
    --spectrum_path . \
    --ms2_tolerance 0.4 \
    --output_path f07531_Prot_02_R2_F08_msgf_ms2rescore.idXML \
    --ms2pip_model_dir null \
    --processes 372 \
    --id_decoy_pattern ^DECOY_ \
    --ms2pip_model TMT --rescoring_engine percolator --calibration_set_size 0.15 --test_fdr 0.05 --feature_generators deeplc,ms2pip \
    2>&1 | tee f07531_Prot_02_R2_F08_ms2rescore.log

cat <<-END_VERSIONS > versions.yml
"NFCORE_QUANTMS:QUANTMS:TMT:ID:PSMRESCORING:MS2RESCORE":
    MS2Rescore: $(echo $(ms2rescore --version 2>&1) | grep -oP 'MS²Rescore \(v\K[^\)]+' )
END_VERSIONS

Error here:

Error: 2024-11-25 15:33:11,160 [<module>] - Creating converter from 7 to 5
2024-11-25 15:33:11,160 [<module>] - Creating converter from 5 to 7
2024-11-25 15:33:11,160 [<module>] - Creating converter from 7 to 5
2024-11-25 15:33:11,160 [<module>] - Creating converter from 5 to 7
2024-11-25 15:33:11,207 [register_filter] - Registered filter: bshuf (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5bshuf.so)
2024-11-25 15:33:11,210 [register_filter] - Registered filter: blosc (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5blosc.so)
2024-11-25 15:33:11,212 [register_filter] - Registered filter: blosc2 (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5blosc2.so)
2024-11-25 15:33:11,213 [register_filter] - Registered filter: bzip2 (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5bzip2.so)
2024-11-25 15:33:11,216 [register_filter] - Registered filter: fcidecomp (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5fcidecomp.so)
2024-11-25 15:33:11,217 [register_filter] - Registered filter: lz4 (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5lz4.so)
2024-11-25 15:33:11,220 [register_filter] - Registered filter: sz (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5sz.so)
2024-11-25 15:33:11,232 [register_filter] - Registered filter: sz3 (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5sz3.so)
2024-11-25 15:33:11,233 [register_filter] - Registered filter: zfp (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5zfp.so)
2024-11-25 15:33:11,234 [register_filter] - init_filter not found for filter zstd: Init phase skipped.
2024-11-25 15:33:11,234 [register_filter] - Registered filter: zstd (/usr/local/lib/python3.10/site-packages/hdf5plugin/plugins/libh5zstd.so)
2024-11-25 15:33:12,843 [<module>] - Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2024-11-25 15:33:15,677 [wrapper] - matplotlib data path: /usr/local/lib/python3.10/site-packages/matplotlib/mpl-data
2024-11-25 15:33:15,683 [wrapper] - CONFIGDIR=/homes/pst_prd/.config/matplotlib
2024-11-25 15:33:15,684 [<module>] - interactive is False
2024-11-25 15:33:15,684 [<module>] - platform is linux
2024-11-25 15:33:15,815 [wrapper] - CACHEDIR=/homes/pst_prd/.cache/matplotlib
2024-11-25 15:33:15,817 [_load_fontmanager] - Using fontManager instance from /homes/pst_prd/.cache/matplotlib/fontlist-v390.json
2024-11-25 15:33:16,706 [parse_cli_arguments_to_config] - Percolator rescoring engine has been specified. Use the idXML containing rescoring features and run Percolator in a separate step.
2024-11-25 15:33:16,706 [ms2rescore] - MS²Rescore config:
2024-11-25 15:33:16,706 [ms2rescore] - {'$schema': './config_schema.json', 'ms2rescore': {'feature_generators': {'ms2pip': {'model_dir': 'null', 'model': 'TMT', 'ms2_tolerance': 0.4}, 'deeplc': {'deeplc_retrain': False, 'calibration_set_size': 0.15}}, 'rescoring_engine': {}, 'co
nfig_file': None, 'psm_file': None, 'psm_file_type': 'infer', 'psm_reader_kwargs': {}, 'spectrum_path': '.', 'output_path': 'f07531_Prot_02_R2_F08_msgf_ms2rescore.idXML', 'log_level': 'info', 'id_decoy_pattern': '^DECOY_', 'psm_id_pattern': '(.*)', 'spectrum_id_pattern': '(.*)',
 'lower_score_is_better': True, 'modification_mapping': {}, 'fixed_modifications': {}, 'processes': 48, 'rename_to_usi': False, 'fasta_file': None, 'write_report': True, 'ms2pip_model_dir': 'null', 'ms2pip_model': 'TMT', 'ms2_tolerance': 0.4, 'calibration_set_size': 0.15, 'rng':
 4711}}
Warning: OPENMS_DATA_PATH environment variable not found and no share directory was installed. Some functionality might not work as expected.
Traceback (most recent call last):
  File "/usr/local/bin/quantmsutilsc", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/quantmsutils/quantmsutilsc.py", line 38, in main
    cli()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/quantmsutils/rescoring/ms2rescore.py", line 375, in ms2rescore
    rescore_idxml(psm_file, output_path, config)
  File "/usr/local/lib/python3.10/site-packages/quantmsutils/rescoring/ms2rescore.py", line 123, in rescore_idxml
    psm_list = reader.read_file()
  File "/usr/local/lib/python3.10/site-packages/psm_utils/io/_base_classes.py", line 46, in read_file
    return PSMList(psm_list=[psm for psm in self.__iter__()])
  File "/usr/local/lib/python3.10/site-packages/psm_utils/io/_base_classes.py", line 46, in <listcomp>
    return PSMList(psm_list=[psm for psm in self.__iter__()])
  File "/usr/local/lib/python3.10/site-packages/psm_utils/io/idxml.py", line 112, in __iter__
    yield self._parse_psm(self.protein_ids, peptide_id, peptide_hit)
  File "/usr/local/lib/python3.10/site-packages/psm_utils/io/idxml.py", line 201, in _parse_psm
    rescoring_features={
  File "/usr/local/lib/python3.10/site-packages/psm_utils/io/idxml.py", line 202, in <dictcomp>
    key: float(peptide_hit.getMetaValue(key)) for key in self.rescoring_features
TypeError: float() argument must be a string or a real number, not 'NoneType'

Command used and terminal output

No response

Relevant files

Dataset here: https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/94508cae253c72810c2dc0200e2c3e/

System information

No response

daichengxin commented 2 days ago

There is a missing search engines feature that is causing the psm utils parse bug. A example image

I don't know why is this the only PSM identification that doesn't have a search engine features? it's probably an internal bug in openms, export search engines features?

ypriverol commented 2 days ago

@timosachsenberg @jpfeuffer looks like some of the search engine scores in MSGFAdapter for some of the psms are not written properly in the idXML.

Two things:

jpfeuffer commented 2 days ago

Which feature is missing?

daichengxin commented 2 days ago

These features image

ypriverol commented 2 days ago

All search engine scores.

jpfeuffer commented 2 days ago

This is right out of msgfplusadapter? Any logs?

ypriverol commented 2 days ago

Here the log from MSGFAdapter.

Running MSGFPlus search...
Using prefix decoy string 'DECOY_'
Info: using 'Trypsin/P' as enzyme (obtained from idXML) for digestion.
Peptide identification engine: MS-GF+
Enzyme: Trypsin/P
Info: using 'full' as enzyme specificity (obtained from idXML) for digestion.
    Progress of 'Load first DB chunk':
    -- done [took 0.21 s (CPU), 0.20 s (Wall)] -- 
Building trie ... done (0s)
Compressing trie to BFS format ...
 done (0s)
Mapping 58414 peptides to 82852 proteins.
Searching with up to 3 ambiguous amino acid(s) and 0 mismatch(es)!
    Progress of 'Aho-Corasick':
    -- done [took 3.99 s (CPU), 0.55 s (Wall)] -- 
Merge took: 0.01 s (wall), 0.02 s (CPU), 0.01 s (system), 0.01 s (user)
Memory usage (Aho-Corasick): 17 MB (working set delta), 0 MB (peak working set delta)

Aho-Corasick done:
  found 69282 hits for 58414 of 58414 peptides.
Peptide hits passing enzyme filter: 69282
     ... rejected by enzyme filter: 38187
-----------------------------------
Peptide statistics

  unmatched                : 0 (0 %)
  target/decoy:
    match to target DB only: 35021 (59 %)
    match to decoy DB only : 22109 (37 %)
    match to both          : 1284 (2 %)

  mapping to proteins:
    no match (to 0 protein)         : 0
    unique match (to 1 protein)     : 53568
    non-unique match (to >1 protein): 4846
-----------------------------------
Protein statistics

  total proteins searched: 82852
  matched proteins       : 33687 (33687 new)
  matched target proteins: 18291 (54 %)
  matched decoy proteins : 15396 (45 %)
  orphaned proteins      : 0 (all removed)
-----------------------------------
MS-GF+ PSM with missing NumMatchedMainIons skipped.
StdevErrorTop7 is NaN, setting as MeanErrorTop7 instead.
MSGFPlusAdapter took 17:29 m (wall), 38.82 s (CPU), 2.07 s (system), 36.75 s (user); Peak Memory Usage: 2619 MB.
<MS-GF+ PSM with missing NumMatchedMainIons skipped.> occurred 2805 times
timosachsenberg commented 2 days ago

which OpenMS version is this?

ypriverol commented 2 days ago

The nightly version.

timosachsenberg commented 2 days ago

if you use the nightly version, then you might want to remove the PSMFeatureExtractor from the workflow. I now integrate the annotation into MSGFPlusAdapter, Comet etc. This was the relevant PR: https://github.com/OpenMS/OpenMS/pull/7323 Maybe try again without PSMFeatureExtractor? Then we can narrow it down (could also be a MSGFPlus error that we "fixed" by calling the PSMFeatureExtractor).

ypriverol commented 2 days ago

This can't be the source of the error because, we have other files that pass.

jpfeuffer commented 2 days ago

<MS-GF+ PSM with missing NumMatchedMainIons skipped.> occurred 2805 times

This does not sound good.

timosachsenberg commented 2 days ago

Apparently, we handle this here as some known MSGFPlus issue: https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/ID/PercolatorFeatureSetHelper.cpp#L21-L101 Could also be that we calculated these generic meta values for every search engine result in the past... need to investigate a bi more.

timosachsenberg commented 2 days ago

Is it possible to get the mzid file written by MSGF+? I want to directly check if the output by MSGF+

looking at some non OpenMS code: https://github.com/percolator/percolator/blob/master/src/converters/MsgfplusReader.cpp#L310-L333

it indeed seems as if these meta values should be written by MSGF+ to mzid.

ypriverol commented 2 days ago

Ok, Im missing something why Percolator is mentioned here before ms2rescore, we should not have that step before I guess.

timosachsenberg commented 2 days ago

On the OpenMS side, we typically:

  1. run a search
  2. add percolator features so percolator later knows which scores to use (as they are search engine dependent).
    • this has been in the past done by PSMFeatureExtractor
    • now this should be built into each search engine (as they already know which are relevant features)

then one can e.g. run ms2rescore to add additional features and finally run percolator to do rescoring on all features

timosachsenberg commented 2 days ago

@daichengxin or @ypriverol can you run MSGF+ from the command line (e.g., without the adapter) and upload the mzid file?

ypriverol commented 2 days ago

I have to run the msgf+ in that file with that database, give me 1h. @timosachsenberg what version of msgf+ should I run, the one failing? Can you let me know which version is that one?

timosachsenberg commented 2 days ago

Yes please. The one that we currently use.

ypriverol commented 2 days ago

This is even more confusing:

MS-GF+ Release (v2023.01.12) (12 January 2023)
Java 17.0.11-internal (N/A)
Linux (amd64, version 4.18.0-513.24.1.el8_9.x86_64)
Loading database files...
Warning: Sequence database contains 144 counts of letter 'U', which does not correspond to an amino acid.
Warning: database does not contain all standard amino acids. Probability 0.05 will be used for all amino acids.
Counting number of distinct peptides in Homo-sapiens-uniprot-reviewed-contam-entrap-decoy-20241105.csarr using Homo-sapiens-uniprot-reviewed-contam-entrap-decoy-20241105.cnlcp
Counting distinct peptides: 33.30% complete.
Counting distinct peptides: 68.54% complete.
Loading database finished (elapsed time: 6.45 sec)
Reading spectra...
Opening mzML file f07531_Prot_02_R2_F08.mzML
Skip spectrum controllerType=0 controllerNumber=1 scan=16 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=17 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=25 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=26 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=156 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=157 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=159 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=160 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=233 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=234 since activationMethod is HCD, not ETD
 ...
Ignoring 0 profile spectra.
Ignoring 0 spectra having less than 10 peaks.
[Error] f07531_Prot_02_R2_F08.mzML does not have any valid spectra

The command that Im using:

java -Xmx36G -jar MSGFPlus.jar \
-s f07531_Prot_02_R2_F08.mzML \
-o f07531_Prot_02_R2_F08_msgf.mzid \
-d Homo-sapiens-uniprot-reviewed-contam-entrap-decoy-20241105.fasta \
-inst 3 \
-t 30ppm \
-ti 0,1 \
-e 1 \
-ntt 2 \
-minLength 6 \
-maxLength 40 \
-minCharge 2 \
-maxCharge 4 \
-n 1 \
-m 2 \
-mod mods.txt \
-addFeatures 1 \
-thread 8 \
2>&1 | tee f07531_Prot_02_R2_F08_msgf.log

FileInfo in that mzML:

hirdparty-sif-latest.img FileInfo -in f07531_Prot_02_R2_F08.mzML
Progress of 'loading mzML':
  Progress of 'loading spectra list':
  -- done [took 38.03 s (CPU), 38.19 s (Wall)] -- 
  Progress of 'loading chromatogram list':
  -- done [took 0.00 s (CPU), 0.01 s (Wall)] -- 
-- done [took 38.07 s (CPU), 38.23 s (Wall) @ 27.44 MiB/s] -- 

Memory usage (loading MS data): 959 MB (working set delta), 963 MB (peak working set delta)

-- General information --

File name: f07531_Prot_02_R2_F08.mzML
File type: mzML

Instrument: Orbitrap Fusion
  Mass Analyzer: Fourier transform ion cyclotron resonance mass spectrometer (resolution: 0)

MS levels: 1, 2, 3
Total number of peaks: 37152315
Number of spectra: 87911

Ranges:
  retention time: 0.04 .. 10799.43 sec (180.0 min)
  mass-to-charge: 0.00 .. 1999.99
    ion mobility: <none>
       intensity: 0.00 .. 5327283200.00

Number of spectra per MS level:
  level 1: 5512
  level 2: 41202
  level 3: 41197

Peak type from metadata (or estimated from data)
  level 1: Centroid (Centroid)
  level 2: Centroid (Centroid)
  level 3: Centroid (Centroid)

Activation methods
    MS-Level 2 & CID (Collision-induced dissociation): 41202
    MS-Level 3 & HCD (beam-type collision-induced dissociation): 411743

Precursor charge distribution:
  charge 0: 41197x
  charge 2: 17284x
  charge 3: 16495x
  charge 4: 5914x
  charge 5: 1228x
  charge 6: 281x

Number of chromatograms: 1
Number of chromatographic peaks: 87911

Number of chromatograms per type: 
  base peak chromatogram:                         1

FileInfo took 39.36 s (wall), 39.18 s (CPU), 1.15 s (system), 38.03 s (user); Peak Memory Usage: 998 MB.
jpfeuffer commented 2 days ago

You are neither using fragmentMethod, nor protocol TMT. This cannot work. I thought you wanted to use the Adapter in debug mode? You need to use the same settings as in the workflow.

ypriverol commented 1 day ago

OK, the adapter in debug mode only produces the mods.txt file. I have to find the corresponding parameters now. I'm investigating.

jpfeuffer commented 1 day ago

The intermediate output will be in a temp folder. Not in the workdir probably. You can check the logs on where it is located.

timosachsenberg commented 1 day ago

-mzid_out

ypriverol commented 1 day ago

Here is the msgfplus+

msgfplus_output.mzid.gz file.

timosachsenberg commented 1 day ago

ok, as this is the file directly generated by MSGF+ it is definitely an MSGF+ issue that some hits don't have these annotations. I also don't see a clear pattern regarding scores etc. when it is not written out. Should this then be something we want to fix in MSGF+? Alternatively, I could:

ypriverol commented 1 day ago

@timosachsenberg @jpfeuffer :

I think the following roadmap is the better, taking into account that msgfplus is under lower maintenance mode:

1 - Fix in the OpenMS for those cases, because at the very end, you need that information and if it happens in another place, you will have the error again.
2 - Fix it in quantms-utils also, because it could happen and we need to validate and show a warning about the failing PSMs https://github.com/bigbio/quantms-utils/pull/36 . 3 - Report the issue to MSGF+ for future fix.

If you want, you can avoid 1 and we implement 2 and 3.

timosachsenberg commented 1 day ago

Ok I can write out some dummy values for those cases. I am pretty sure that this will not have a negative impact on rescoring.

ypriverol commented 1 day ago

I think, if the scores are not there, the PSMs shouldn't be writing.

timosachsenberg commented 1 day ago

ok let's write them out for now. They can be easily filtered with IDFilter if needed and this let's us evaluate if we loose some good ids. If not I can just skip them

ypriverol commented 1 day ago

We will not introduce a new step IDFilter. But I guess when we pass it to PercolatorAdapter they will removed.

jpfeuffer commented 1 day ago

You can just have the filter in the same step. No need for additional steps.

ypriverol commented 1 day ago

@jpfeuffer, what do you mean? Do you mean for msgf+ add an extra command rigth after MSGFAdapter to filter out the nonsense psms?

If we decide to implement this solution what means, in the commandline, this is the help page of IDFILTER:

IDFilter -- Filters results from protein or peptide identification engines based on different criteria.
Full documentation: http://www.openms.de/doxygen/nightly/html/TOPP_IDFilter.html
Version: 3.4.0-pre-nightly-2024-11-27 Nov 28 2024, 02:39:09, Revision: 017cde2
To cite OpenMS:
 + Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec
   trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7.

Usage:
  IDFilter <options>

Options (mandatory options marked with '*'):
  -in <file>*                                               Input file  (valid formats: 'idXML', 'consensusXM
                                                            L')
  -out <file>*                                              Output file  (valid formats: 'idXML', 'consensusX
                                                            ML')

Filtering by precursor attributes (RT, m/z, charge, length):
  -precursor:rt [min]:[max]                                 Retention time range to extract. (default: ':')
  -precursor:mz [min]:[max]                                 Mass-to-charge range to extract. (default: ':')
  -precursor:length [min]:[max]                             Keep only peptide hits with a sequence length in 
                                                            this range. (default: ':')
  -precursor:charge [min]:[max]                             Keep only peptide hits with charge states in this
                                                             range. (default: ':')

Filtering by peptide/protein score.:
  -score:psm <score>                                        The score which should be reached by a peptide 
                                                            hit to be kept. (use 'NAN' to disable this filter
                                                            ) (default: 'NaN')
  -score:peptide <score>                                    The score which should be reached by a peptide 
                                                            hit to be kept.  (use 'NAN' to disable this filte
                                                            r) (default: 'NaN')
  -score:protein <score>                                    The score which should be reached by a protein 
                                                            hit to be kept. All proteins are filtered based 
                                                            on their singleton scores irrespective of groupin
                                                            g. Use in combination with 'delete_unreferenced_p
                                                            eptide_hits' to remove affected peptides. (use 
                                                            'NAN' to disable this filter) (default: 'NaN')
  -score:proteingroup <score>                               The score which should be reached by a protein 
                                                            group to be kept. Performs group level score filt
                                                            ering (including groups of single proteins). Use 
                                                            in combination with 'delete_unreferenced_peptide_
                                                            hits' to remove affected peptides. (use 'NAN' to 
                                                            disable this filter) (default: 'NaN')

Filtering by whitelisting (only peptides/proteins from a given set can pass):
  -whitelist:proteins <file>                                Filename of a FASTA file containing protein seque
                                                            nces.
                                                            All peptides that are not referencing a protein 
                                                            in this file are removed.
                                                            All proteins whose accessions are not present in 
                                                            this file are removed. (valid formats: 'fasta')
  -whitelist:protein_accessions <accessions>                All peptides that do not reference at least one 
                                                            of the provided protein accession are removed.
                                                            Only proteins of the provided list are retained.
  -whitelist:peptides <file>                                Only peptides with the same sequence and modifica
                                                            tion assignment as any peptide in this file are 
                                                            kept. Use with 'whitelist:ignore_modifications' 
                                                            to only compare by sequence.
                                                             (valid formats: 'idXML')

Filtering by blacklisting (only peptides/proteins NOT present in a given set can pass):
  -blacklist:proteins <file>                                Filename of a FASTA file containing protein seque
                                                            nces.
                                                            All peptides that are referencing a protein in 
                                                            this file are removed.
                                                            All proteins whose accessions are present in this
                                                             file are removed. (valid formats: 'fasta')
  -blacklist:protein_accessions <accessions>                All peptides that reference at least one of the 
                                                            provided protein accession are removed.
                                                            Only proteins not in the provided list are retain
                                                            ed.
  -blacklist:peptides <file>                                Peptides with the same sequence and modification 
                                                            assignment as any peptide in this file are filter
                                                            ed out. Use with 'blacklist:ignore_modifications'
                                                             to only compare by sequence.
                                                             (valid formats: 'idXML')

This filter option removes peptide hits which are not in the list of in silico peptides generated by the rule
s specified below:
  -in_silico_digestion:fasta <file>                         Fasta protein sequence database. (valid formats: 
                                                            'fasta')

This filter option removes peptide hits which do not confirm with the allowed missed cleavages specified belo
w.:
  -missed_cleavages:number_of_missed_cleavages [min]:[max]  Range of allowed missed cleavages in the peptide 
                                                            sequences.
                                                            For example: 0:1 -> peptides with two or more 
                                                            missed cleavages will be removed,
                                                            0:0 -> peptides with any missed cleavages will 
                                                            be removed (default: ':')

Filtering best hits per spectrum (for peptides) or from proteins:
  -best:n_spectra <integer>                                 Keep only the 'n' best spectra (i.e., PeptideIden
                                                            tifications) (for n > 0). A spectrum is considere
                                                            d better if it has a higher scoring peptide hit 
                                                            than the other spectrum. (default: '0') (min: 
                                                            '0')
  -best:n_peptide_hits <integer>                            Keep only the 'n' highest scoring peptide hits 
                                                            per spectrum (for n > 0). (default: '0') (min: 
                                                            '0')
  -best:spectrum_per_peptide <String>                       Keep one spectrum per peptide. Value determines 
                                                            if same sequence but different charges or modific
                                                            ations are treated as separate peptides or the 
                                                            same peptide. (default: false = filter disabled).
                                                             (default: 'false') (valid: 'false', 'sequence', 
                                                            'sequence+charge', 'sequence+modification', 'sequ
                                                            ence+charge+modification')
  -best:n_protein_hits <integer>                            Keep only the 'n' highest scoring protein hits 
                                                            (for n > 0). (default: '0') (min: '0')
  -best:strict                                              Keep only the highest scoring peptide hit.
                                                            Similar to n_peptide_hits=1, but if there are 
                                                            ties between two or more highest scoring hits, 
                                                            none are kept.

  -var_mods                                                 Keep only peptide hits with variable modification
                                                            s (as defined in the 'SearchParameters' section 
                                                            of the input file).
  -remove_shared_peptides                                   Only peptides matching exactly one protein are 
                                                            kept. Remember that isoforms count as different 
                                                            proteins!
  -keep_unreferenced_protein_hits                           Proteins not referenced by a peptide are retained
                                                             in the IDs.
  -remove_decoys                                            Remove proteins according to the information in 
                                                            the user parameters. Usually used in combination 
                                                            with 'delete_unreferenced_peptide_hits'.
  -delete_unreferenced_peptide_hits                         Peptides not referenced by any protein are delete
                                                            d in the IDs. Usually used in combination with 
                                                            'score:protein' or 'thresh:prot'.

Common TOPP options:
  -ini <file>                                               Use the given TOPP INI file
  -threads <n>                                              Sets the number of threads allowed to be used by 
                                                            the TOPP tool (default: '1')
  -write_ini <file>                                         Writes the default configuration file
  --help                                                    Shows options
  --helphelp                                                Shows all options (including advanced)
timosachsenberg commented 1 day ago

Probably something like:

-remove_peptide_hits_by_metavalue [NumMatchedMainIons eq 0]

right @jpfeuffer

ypriverol commented 1 day ago

@timosachsenberg @jpfeuffer can you take a look at this PR, we are also validating it in the ms2rescore, just in case in the future something happen similarly: https://github.com/bigbio/quantms-utils/pull/36

timosachsenberg commented 1 day ago

I merged the fix for MSGF+ here https://github.com/OpenMS/OpenMS/pull/7713

ypriverol commented 1 day ago

@timosachsenberg what will be the parameter for idfilter, if we added it in the msgf process step.

ypriverol commented 12 hours ago

@timosachsenberg @jpfeuffer I have re-run the dataset with the latest version of MSGF+ 202403.. Here the results:

Current version in OpenMS 2023: https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/94508cae253c72810c2dc0200e2c3e/msgfplus_output-2023.mzid.gz

Latest version MSGF+ 2024: https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/94508cae253c72810c2dc0200e2c3e/msgfplus_output-2024.mzid.gz

Can you double check if for the mzId from the latest version the problems remains?

timosachsenberg commented 12 hours ago

both have the issue. I am creating the container with the fix on OpenMS side.

Update: container should be updated now

ypriverol commented 12 hours ago

Ok, Do you want to report the issue to MSGF+ or should I do it? The conda package is still not available, my point is the deployment of the conda nightly.

timosachsenberg commented 11 hours ago

yes please report. If you are familiar with the code you might also give it a quick look why these values are not written out