Open ypriverol opened 4 days ago
There is a missing search engines feature that is causing the psm utils parse bug. A example
I don't know why is this the only PSM identification that doesn't have a search engine features? it's probably an internal bug in openms, export search engines features?
@timosachsenberg @jpfeuffer looks like some of the search engine scores in MSGFAdapter for some of the psms are not written properly in the idXML.
Two things:
Which feature is missing?
These features
All search engine scores.
This is right out of msgfplusadapter? Any logs?
Here the log from MSGFAdapter.
Running MSGFPlus search...
Using prefix decoy string 'DECOY_'
Info: using 'Trypsin/P' as enzyme (obtained from idXML) for digestion.
Peptide identification engine: MS-GF+
Enzyme: Trypsin/P
Info: using 'full' as enzyme specificity (obtained from idXML) for digestion.
Progress of 'Load first DB chunk':
-- done [took 0.21 s (CPU), 0.20 s (Wall)] --
Building trie ... done (0s)
Compressing trie to BFS format ...
done (0s)
Mapping 58414 peptides to 82852 proteins.
Searching with up to 3 ambiguous amino acid(s) and 0 mismatch(es)!
Progress of 'Aho-Corasick':
-- done [took 3.99 s (CPU), 0.55 s (Wall)] --
Merge took: 0.01 s (wall), 0.02 s (CPU), 0.01 s (system), 0.01 s (user)
Memory usage (Aho-Corasick): 17 MB (working set delta), 0 MB (peak working set delta)
Aho-Corasick done:
found 69282 hits for 58414 of 58414 peptides.
Peptide hits passing enzyme filter: 69282
... rejected by enzyme filter: 38187
-----------------------------------
Peptide statistics
unmatched : 0 (0 %)
target/decoy:
match to target DB only: 35021 (59 %)
match to decoy DB only : 22109 (37 %)
match to both : 1284 (2 %)
mapping to proteins:
no match (to 0 protein) : 0
unique match (to 1 protein) : 53568
non-unique match (to >1 protein): 4846
-----------------------------------
Protein statistics
total proteins searched: 82852
matched proteins : 33687 (33687 new)
matched target proteins: 18291 (54 %)
matched decoy proteins : 15396 (45 %)
orphaned proteins : 0 (all removed)
-----------------------------------
MS-GF+ PSM with missing NumMatchedMainIons skipped.
StdevErrorTop7 is NaN, setting as MeanErrorTop7 instead.
MSGFPlusAdapter took 17:29 m (wall), 38.82 s (CPU), 2.07 s (system), 36.75 s (user); Peak Memory Usage: 2619 MB.
<MS-GF+ PSM with missing NumMatchedMainIons skipped.> occurred 2805 times
which OpenMS version is this?
The nightly version.
if you use the nightly version, then you might want to remove the PSMFeatureExtractor from the workflow. I now integrate the annotation into MSGFPlusAdapter, Comet etc. This was the relevant PR: https://github.com/OpenMS/OpenMS/pull/7323 Maybe try again without PSMFeatureExtractor? Then we can narrow it down (could also be a MSGFPlus error that we "fixed" by calling the PSMFeatureExtractor).
This can't be the source of the error because, we have other files that pass.
<MS-GF+ PSM with missing NumMatchedMainIons skipped.> occurred 2805 times
This does not sound good.
Apparently, we handle this here as some known MSGFPlus issue: https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/ID/PercolatorFeatureSetHelper.cpp#L21-L101 Could also be that we calculated these generic meta values for every search engine result in the past... need to investigate a bi more.
Is it possible to get the mzid file written by MSGF+? I want to directly check if the output by MSGF+
looking at some non OpenMS code: https://github.com/percolator/percolator/blob/master/src/converters/MsgfplusReader.cpp#L310-L333
it indeed seems as if these meta values should be written by MSGF+ to mzid.
Ok, Im missing something why Percolator is mentioned here before ms2rescore, we should not have that step before I guess.
On the OpenMS side, we typically:
then one can e.g. run ms2rescore to add additional features and finally run percolator to do rescoring on all features
@daichengxin or @ypriverol can you run MSGF+ from the command line (e.g., without the adapter) and upload the mzid file?
I have to run the msgf+ in that file with that database, give me 1h. @timosachsenberg what version of msgf+ should I run, the one failing? Can you let me know which version is that one?
Yes please. The one that we currently use.
This is even more confusing:
MS-GF+ Release (v2023.01.12) (12 January 2023)
Java 17.0.11-internal (N/A)
Linux (amd64, version 4.18.0-513.24.1.el8_9.x86_64)
Loading database files...
Warning: Sequence database contains 144 counts of letter 'U', which does not correspond to an amino acid.
Warning: database does not contain all standard amino acids. Probability 0.05 will be used for all amino acids.
Counting number of distinct peptides in Homo-sapiens-uniprot-reviewed-contam-entrap-decoy-20241105.csarr using Homo-sapiens-uniprot-reviewed-contam-entrap-decoy-20241105.cnlcp
Counting distinct peptides: 33.30% complete.
Counting distinct peptides: 68.54% complete.
Loading database finished (elapsed time: 6.45 sec)
Reading spectra...
Opening mzML file f07531_Prot_02_R2_F08.mzML
Skip spectrum controllerType=0 controllerNumber=1 scan=16 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=17 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=25 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=26 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=156 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=157 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=159 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=160 since activationMethod is HCD, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=233 since activationMethod is CID, not ETD
Skip spectrum controllerType=0 controllerNumber=1 scan=234 since activationMethod is HCD, not ETD
...
Ignoring 0 profile spectra.
Ignoring 0 spectra having less than 10 peaks.
[Error] f07531_Prot_02_R2_F08.mzML does not have any valid spectra
The command that Im using:
java -Xmx36G -jar MSGFPlus.jar \
-s f07531_Prot_02_R2_F08.mzML \
-o f07531_Prot_02_R2_F08_msgf.mzid \
-d Homo-sapiens-uniprot-reviewed-contam-entrap-decoy-20241105.fasta \
-inst 3 \
-t 30ppm \
-ti 0,1 \
-e 1 \
-ntt 2 \
-minLength 6 \
-maxLength 40 \
-minCharge 2 \
-maxCharge 4 \
-n 1 \
-m 2 \
-mod mods.txt \
-addFeatures 1 \
-thread 8 \
2>&1 | tee f07531_Prot_02_R2_F08_msgf.log
FileInfo in that mzML:
hirdparty-sif-latest.img FileInfo -in f07531_Prot_02_R2_F08.mzML
Progress of 'loading mzML':
Progress of 'loading spectra list':
-- done [took 38.03 s (CPU), 38.19 s (Wall)] --
Progress of 'loading chromatogram list':
-- done [took 0.00 s (CPU), 0.01 s (Wall)] --
-- done [took 38.07 s (CPU), 38.23 s (Wall) @ 27.44 MiB/s] --
Memory usage (loading MS data): 959 MB (working set delta), 963 MB (peak working set delta)
-- General information --
File name: f07531_Prot_02_R2_F08.mzML
File type: mzML
Instrument: Orbitrap Fusion
Mass Analyzer: Fourier transform ion cyclotron resonance mass spectrometer (resolution: 0)
MS levels: 1, 2, 3
Total number of peaks: 37152315
Number of spectra: 87911
Ranges:
retention time: 0.04 .. 10799.43 sec (180.0 min)
mass-to-charge: 0.00 .. 1999.99
ion mobility: <none>
intensity: 0.00 .. 5327283200.00
Number of spectra per MS level:
level 1: 5512
level 2: 41202
level 3: 41197
Peak type from metadata (or estimated from data)
level 1: Centroid (Centroid)
level 2: Centroid (Centroid)
level 3: Centroid (Centroid)
Activation methods
MS-Level 2 & CID (Collision-induced dissociation): 41202
MS-Level 3 & HCD (beam-type collision-induced dissociation): 411743
Precursor charge distribution:
charge 0: 41197x
charge 2: 17284x
charge 3: 16495x
charge 4: 5914x
charge 5: 1228x
charge 6: 281x
Number of chromatograms: 1
Number of chromatographic peaks: 87911
Number of chromatograms per type:
base peak chromatogram: 1
FileInfo took 39.36 s (wall), 39.18 s (CPU), 1.15 s (system), 38.03 s (user); Peak Memory Usage: 998 MB.
You are neither using fragmentMethod, nor protocol TMT. This cannot work. I thought you wanted to use the Adapter in debug mode? You need to use the same settings as in the workflow.
OK, the adapter in debug mode only produces the mods.txt file. I have to find the corresponding parameters now. I'm investigating.
The intermediate output will be in a temp folder. Not in the workdir probably. You can check the logs on where it is located.
-mzid_out
Here is the msgfplus+
msgfplus_output.mzid.gz file.
ok, as this is the file directly generated by MSGF+ it is definitely an MSGF+ issue that some hits don't have these annotations. I also don't see a clear pattern regarding scores etc. when it is not written out. Should this then be something we want to fix in MSGF+? Alternatively, I could:
@timosachsenberg @jpfeuffer :
I think the following roadmap is the better, taking into account that msgfplus is under lower maintenance mode:
1 - Fix in the OpenMS for those cases, because at the very end, you need that information and if it happens in another place, you will have the error again.
2 - Fix it in quantms-utils also, because it could happen and we need to validate and show a warning about the failing PSMs https://github.com/bigbio/quantms-utils/pull/36 .
3 - Report the issue to MSGF+ for future fix.
If you want, you can avoid 1 and we implement 2 and 3.
Ok I can write out some dummy values for those cases. I am pretty sure that this will not have a negative impact on rescoring.
I think, if the scores are not there, the PSMs shouldn't be writing.
ok let's write them out for now. They can be easily filtered with IDFilter if needed and this let's us evaluate if we loose some good ids. If not I can just skip them
We will not introduce a new step IDFilter. But I guess when we pass it to PercolatorAdapter
they will removed.
You can just have the filter in the same step. No need for additional steps.
@jpfeuffer, what do you mean? Do you mean for msgf+ add an extra command rigth after MSGFAdapter to filter out the nonsense psms?
If we decide to implement this solution what means, in the commandline, this is the help page of IDFILTER:
IDFilter -- Filters results from protein or peptide identification engines based on different criteria.
Full documentation: http://www.openms.de/doxygen/nightly/html/TOPP_IDFilter.html
Version: 3.4.0-pre-nightly-2024-11-27 Nov 28 2024, 02:39:09, Revision: 017cde2
To cite OpenMS:
+ Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec
trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7.
Usage:
IDFilter <options>
Options (mandatory options marked with '*'):
-in <file>* Input file (valid formats: 'idXML', 'consensusXM
L')
-out <file>* Output file (valid formats: 'idXML', 'consensusX
ML')
Filtering by precursor attributes (RT, m/z, charge, length):
-precursor:rt [min]:[max] Retention time range to extract. (default: ':')
-precursor:mz [min]:[max] Mass-to-charge range to extract. (default: ':')
-precursor:length [min]:[max] Keep only peptide hits with a sequence length in
this range. (default: ':')
-precursor:charge [min]:[max] Keep only peptide hits with charge states in this
range. (default: ':')
Filtering by peptide/protein score.:
-score:psm <score> The score which should be reached by a peptide
hit to be kept. (use 'NAN' to disable this filter
) (default: 'NaN')
-score:peptide <score> The score which should be reached by a peptide
hit to be kept. (use 'NAN' to disable this filte
r) (default: 'NaN')
-score:protein <score> The score which should be reached by a protein
hit to be kept. All proteins are filtered based
on their singleton scores irrespective of groupin
g. Use in combination with 'delete_unreferenced_p
eptide_hits' to remove affected peptides. (use
'NAN' to disable this filter) (default: 'NaN')
-score:proteingroup <score> The score which should be reached by a protein
group to be kept. Performs group level score filt
ering (including groups of single proteins). Use
in combination with 'delete_unreferenced_peptide_
hits' to remove affected peptides. (use 'NAN' to
disable this filter) (default: 'NaN')
Filtering by whitelisting (only peptides/proteins from a given set can pass):
-whitelist:proteins <file> Filename of a FASTA file containing protein seque
nces.
All peptides that are not referencing a protein
in this file are removed.
All proteins whose accessions are not present in
this file are removed. (valid formats: 'fasta')
-whitelist:protein_accessions <accessions> All peptides that do not reference at least one
of the provided protein accession are removed.
Only proteins of the provided list are retained.
-whitelist:peptides <file> Only peptides with the same sequence and modifica
tion assignment as any peptide in this file are
kept. Use with 'whitelist:ignore_modifications'
to only compare by sequence.
(valid formats: 'idXML')
Filtering by blacklisting (only peptides/proteins NOT present in a given set can pass):
-blacklist:proteins <file> Filename of a FASTA file containing protein seque
nces.
All peptides that are referencing a protein in
this file are removed.
All proteins whose accessions are present in this
file are removed. (valid formats: 'fasta')
-blacklist:protein_accessions <accessions> All peptides that reference at least one of the
provided protein accession are removed.
Only proteins not in the provided list are retain
ed.
-blacklist:peptides <file> Peptides with the same sequence and modification
assignment as any peptide in this file are filter
ed out. Use with 'blacklist:ignore_modifications'
to only compare by sequence.
(valid formats: 'idXML')
This filter option removes peptide hits which are not in the list of in silico peptides generated by the rule
s specified below:
-in_silico_digestion:fasta <file> Fasta protein sequence database. (valid formats:
'fasta')
This filter option removes peptide hits which do not confirm with the allowed missed cleavages specified belo
w.:
-missed_cleavages:number_of_missed_cleavages [min]:[max] Range of allowed missed cleavages in the peptide
sequences.
For example: 0:1 -> peptides with two or more
missed cleavages will be removed,
0:0 -> peptides with any missed cleavages will
be removed (default: ':')
Filtering best hits per spectrum (for peptides) or from proteins:
-best:n_spectra <integer> Keep only the 'n' best spectra (i.e., PeptideIden
tifications) (for n > 0). A spectrum is considere
d better if it has a higher scoring peptide hit
than the other spectrum. (default: '0') (min:
'0')
-best:n_peptide_hits <integer> Keep only the 'n' highest scoring peptide hits
per spectrum (for n > 0). (default: '0') (min:
'0')
-best:spectrum_per_peptide <String> Keep one spectrum per peptide. Value determines
if same sequence but different charges or modific
ations are treated as separate peptides or the
same peptide. (default: false = filter disabled).
(default: 'false') (valid: 'false', 'sequence',
'sequence+charge', 'sequence+modification', 'sequ
ence+charge+modification')
-best:n_protein_hits <integer> Keep only the 'n' highest scoring protein hits
(for n > 0). (default: '0') (min: '0')
-best:strict Keep only the highest scoring peptide hit.
Similar to n_peptide_hits=1, but if there are
ties between two or more highest scoring hits,
none are kept.
-var_mods Keep only peptide hits with variable modification
s (as defined in the 'SearchParameters' section
of the input file).
-remove_shared_peptides Only peptides matching exactly one protein are
kept. Remember that isoforms count as different
proteins!
-keep_unreferenced_protein_hits Proteins not referenced by a peptide are retained
in the IDs.
-remove_decoys Remove proteins according to the information in
the user parameters. Usually used in combination
with 'delete_unreferenced_peptide_hits'.
-delete_unreferenced_peptide_hits Peptides not referenced by any protein are delete
d in the IDs. Usually used in combination with
'score:protein' or 'thresh:prot'.
Common TOPP options:
-ini <file> Use the given TOPP INI file
-threads <n> Sets the number of threads allowed to be used by
the TOPP tool (default: '1')
-write_ini <file> Writes the default configuration file
--help Shows options
--helphelp Shows all options (including advanced)
Probably something like:
-remove_peptide_hits_by_metavalue [NumMatchedMainIons eq 0]
right @jpfeuffer
@timosachsenberg @jpfeuffer can you take a look at this PR, we are also validating it in the ms2rescore, just in case in the future something happen similarly: https://github.com/bigbio/quantms-utils/pull/36
I merged the fix for MSGF+ here https://github.com/OpenMS/OpenMS/pull/7713
@timosachsenberg what will be the parameter for idfilter, if we added it in the msgf process step.
@timosachsenberg @jpfeuffer I have re-run the dataset with the latest version of MSGF+ 202403.. Here the results:
Current version in OpenMS 2023: https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/94508cae253c72810c2dc0200e2c3e/msgfplus_output-2023.mzid.gz
Latest version MSGF+ 2024: https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/94508cae253c72810c2dc0200e2c3e/msgfplus_output-2024.mzid.gz
Can you double check if for the mzId from the latest version the problems remains?
both have the issue. I am creating the container with the fix on OpenMS side.
Update: container should be updated now
Ok, Do you want to report the issue to MSGF+ or should I do it? The conda package is still not available, my point is the deployment of the conda nightly.
yes please report. If you are familiar with the code you might also give it a quick look why these values are not written out
Description of the bug
This command is failing:
Error here:
Command used and terminal output
No response
Relevant files
Dataset here: https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/94508cae253c72810c2dc0200e2c3e/
System information
No response