MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
77 stars 36 forks source link

All peptides were ranked "1" in output file #156

Closed JangHuang closed 1 month ago

JangHuang commented 1 month ago

Hello,

I used MS-GF+ on the mzML file, using the .faa file as reference, and produce the .mzid file. All files are in the following link: https://figshare.com/s/b65fc594da19f0f9347f . The raw data was produced by Bruker Impact II (Q-TOF MS/MS), and was transformed into mzML file by ProteoWizard msconvert.

However, the mzid file (HH090441864_2) ended up matching multiple peptides to one same spectrum, and they were all ranked 1. It seems to have led to some problem when I try to joint the identification data into the spectrum data (using command countIdentifications in the package Spectra). My codes were as below. The MS-GF+ was run in Mac terminal; other codes were in R Studio.

java -version 
java version "1.8.0_421"

java -Xmx3500M -jar MSGFPlus.jar -s "HH090441864_2024-09-09_2648.mzML" -d "protein.faa" -tda 0 -inst 2 -e 1 -maxMissedCleavages 2 -o HH090441864_2.mzid 
library(Spectra)
library(PSMatch)
sp<-Spectra("HH090441864_2024-09-09_2648.mzML")
id<-readPSMs("HH090441864_2.mzid")

id_filtered<-filterPSMs(id)
id_f_r<-reducePSMs(id_filtered)

sp_ident <- joinSpectraData(sp, id_f_r,
                      by.x = "spectrumId",
                      by.y = "spectrumID")

sp_ident_count <- countIdentifications(sp_ident)

#Error: BiocParallel errors
  #1 remote errors, element index: 1
  #0 unevaluated and other errors
  #first remote error:
#Error in as.vector(x, mode): coercing an AtomicList object to an atomic vector is supported only for
  #objects with top-level elements of length <= 1

Thank you in advance for any thoughts!

FarmGeek4Life commented 1 month ago

So, if you look at the SpectrumIdentificationItem entries under a single SpectrumIdentificationResult, there should only be duplicate values for 'rank' in that same SpectrumIdentificationResult if their scores are identical. This is according to the mzIdentML specification, which states for 'rank': For an MS/MS result set, this is the rank of the identification quality as scored by the search engine. 1 is the top rank. If multiple identifications have the same top score, they should all be assigned rank =1. For PMF data, the rank attribute may be meaningless and values of rank = 0 should be given.

So, if those multiple matches with rank 1 for the same spectrum have the same specEValue, then the mzid is valid, and the issue lies in the Spectra package.

JangHuang commented 1 month ago

So, if you look at the SpectrumIdentificationItem entries under a single SpectrumIdentificationResult, there should only be duplicate values for 'rank' in that same SpectrumIdentificationResult if their scores are identical. This is according to the mzIdentML specification, which states for 'rank': For an MS/MS result set, this is the rank of the identification quality as scored by the search engine. 1 is the top rank. If multiple identifications have the same top score, they should all be assigned rank =1. For PMF data, the rank attribute may be meaningless and values of rank = 0 should be given.

So, if those multiple matches with rank 1 for the same spectrum have the same specEValue, then the mzid is valid, and the issue lies in the Spectra package.

Thank you very much for the respond! Is it reasonable that I have multiple peptide matches with the exact same score? Or should I assume there's something wrong with my parameters? I also tried specifying the "-n NumMatchesPerSpec" in MSGFPlus as below:

java -Xmx3500M -jar MSGFPlus.jar -s "/Users/jjaneyellow/Desktop/CBS-PhD Program/Methodology/MS-Proteomic (Renin AGT Alb)/Proteomic/20240925_MascotSearch/HH090441864_2024-09-09_2648.mzML" -d "/Users/jjaneyellow/Desktop/CBS-PhD Program/Methodology/MS-Proteomic (Renin AGT Alb)/F.catus_Fca126_mat1.0/ncbi_dataset/data/GCF_018350175.1/protein.faa" -tda 0 -inst 2 -e 1 -maxMissedCleavages 2 -o HH090441864_2.mzid -n 1

but I got the exact same error when running countIdentifications.

FarmGeek4Life commented 1 month ago

And... you can ignore almost everything I said in that reply. Your .mzid file does not have any cases of '2 different peptides matched the same spectrum with the same score'; all of the results in that file are a single peptide match per spectrum, and therefore all results are 'rank 1', since they are the highest-scoring peptide match for that spectrum.

What you are seeing is multiple 'PeptideEvidence' entries; 'PeptideEvidence' in .mzid files is a combination of protein, location in protein, start and end residues, and modified peptide. If you were to map those entries, you would see that each one links to the same modified peptide; they are different locations and proteins where that peptide occurs.

JangHuang commented 1 month ago

I think I do have multiple match to the same spectrum?

table(table(id_filtered $spectrumID))
id_tb<- table(id_filtered$spectrumID) %>% as.data.frame() #find the scans with 4 matches
id_4 <-id_filtered[id_filtered$spectrumID == "scan=2192", ] %>% as.data.frame()
id_4[, c("modName", "modLocation")]

After reducing the PSM object, each spectrum was reduced into one row; but I was wondering where this caused problems for countIdentifications.

id_f_r<-reducePSMs(id_filtered)
table(table(id_f_r $spectrumID))
FarmGeek4Life commented 1 month ago

But, when looking at the result with spectrumID="scan=2192", there is still only one peptide match, and only one PeptideEvidenceRef, meaning that this is also a single protein match. spectrumID="scan=5549" has 2 PeptideEvidenceRef entries, and when I look at them they refer to the exact same peptide sequence, in two different proteins.

I can see you are following the example from https://www.bioconductor.org/packages/release/bioc/vignettes/PSMatch/inst/doc/Fragments.html, but I noticed one error you have: why are you trying to join the identifications from HH090441864_2024-09-09_2648.mzML with the spectral data from HH090441860_2024-09-09_2646.mzML?

FarmGeek4Life commented 1 month ago

Also, looking at the code for countIdentifications, you might need to change how you call that; there is a second parameter that has a default value of 'sequence' that you probably need to change; see https://github.com/rformassspectrometry/Spectra/blob/1bb1a1d7ee5e313cb1e62d72d2d25a360ff29392/R/countIdentifications.R

JangHuang commented 1 month ago

Thank you very much for spotting the HH860 mistake!! I've corrected the code in the post. I still encountered the same error using both mzML file and mzId file from HH090441864; the files provided in figshare link were also correct (both were from the scan HH090441864). And yes exactly, I am following the textbook from the same author: https://rformassspectrometry.github.io/book/sec-id.html

And thank you so much for providing the information about countIdentifications codes! I'm sorry if the answer is obvious, but I couldn't find a suitable parameter to substitute "sequence"... is there any parameter that should work and would make good sense?

Thank you!

FarmGeek4Life commented 1 month ago

You will need to look at the column names in the DataFrame you are merging with the spectra, and find one that you want to use, or else use a script to add one; the code states that it wants a variable/column that will either be some value (like a peptide sequence) for a match, or 'NA' for no match.

JangHuang commented 1 month ago

I tried working on this, and I think it's possible to add one column or randomly remove some of the redundant peptide matches. However, I'm not sure how legit it is to do so, given that they were all ranked 1 by MSGFPlus...

In case this helps, these were the terminal messages I saw when running MSGFPlus:

java -Xmx3500M -jar MSGFPlus.jar -s "HH090441864_2024-09-09_2648.mzML" -d "protein.faa" -tda 0 -inst 2 -e 1 -maxMissedCleavages 2 -o HH090441864_1001.mzid
MS-GF+ Release (v2024.03.26) (26 March 2024)
Java 1.8.0_421 (Oracle Corporation)
Mac OS X (aarch64, version 14.5)
Loading database files...
Warning: Sequence database contains 53 counts of letter 'U', which does not correspond to an amino acid.
Warning: Sequence database contains 1062 counts of letter 'X', which does not correspond to an amino acid.
Counting number of distinct peptides in protein.csarr using protein.cnlcp
Counting distinct peptides: 62.73% complete.
Loading database finished (elapsed time: 3.53 sec)
Reading spectra...
Opening mzML file HH090441864_2024-09-09_2648.mzML
Ignoring 0 profile spectra.
Ignoring 0 spectra having less than 10 peaks.
Reading spectra finished (elapsed time: 25.24 sec)
Using 8 threads.
Search Parameters:
    PrecursorMassTolerance: 20.0 ppm
    IsotopeError: 0,1
    TargetDecoyAnalysis: false
    FragmentationMethod: As written in the spectrum or CID if no info
    Instrument: TOF
    Enzyme: Tryp
    Protocol: Automatic
    NumTolerableTermini: 2
    IgnoreMetCleavage: false
    MinPepLength: 6
    MaxPepLength: 40
    MinCharge: 2
    MaxCharge: 3
    NumMatchesPerSpec: 1
    MaxMissedCleavages: 2
    MaxNumModsPerPeptide: 3
    ChargeCarrierMass: 1.00727649 (proton)
    MinNumPeaksPerSpectrum: 10
    NumIsoforms: 128
Post translational modifications in use:
    Fixed (static):     Carbamidomethyl on C (+57.0215)

Spectrum 0-9092 (total: 9093)
Splitting work into 24 tasks.
Search progress: 0 / 24 tasks, 0.00%        0.00 seconds elapsed
Search progress: 1 / 24 tasks, 29.30%       26.34 seconds elapsed
Search progress: 2 / 24 tasks, 31.93%       30.96 seconds elapsed
Search progress: 3 / 24 tasks, 35.29%       36.10 seconds elapsed
Search progress: 4 / 24 tasks, 38.74%       40.83 seconds elapsed
Search progress: 5 / 24 tasks, 40.09%       43.30 seconds elapsed
Search progress: 6 / 24 tasks, 40.83%       44.07 seconds elapsed
Search progress: 7 / 24 tasks, 42.34%       45.73 seconds elapsed
Search progress: 8 / 24 tasks, 45.09%       47.51 seconds elapsed
Search progress: 8 / 24 tasks, 58.45%       1.00 minutes elapsed
Search progress: 9 / 24 tasks, 61.85%       1.11 minutes elapsed
Search progress: 10 / 24 tasks, 67.28%      1.35 minutes elapsed
Search progress: 11 / 24 tasks, 71.24%      1.52 minutes elapsed
Search progress: 12 / 24 tasks, 72.17%      1.56 minutes elapsed
Search progress: 13 / 24 tasks, 75.76%      1.61 minutes elapsed
Search progress: 14 / 24 tasks, 77.33%      1.66 minutes elapsed
Search progress: 15 / 24 tasks, 80.14%      1.69 minutes elapsed
Search progress: 16 / 24 tasks, 81.32%      1.70 minutes elapsed
Search progress: 17 / 24 tasks, 95.57%      1.99 minutes elapsed
Search progress: 17 / 24 tasks, 95.71%      2.00 minutes elapsed
Search progress: 18 / 24 tasks, 97.37%      2.14 minutes elapsed
Search progress: 19 / 24 tasks, 97.99%      2.19 minutes elapsed
Search progress: 20 / 24 tasks, 98.33%      2.20 minutes elapsed
Search progress: 21 / 24 tasks, 99.07%      2.25 minutes elapsed
Search progress: 22 / 24 tasks, 99.44%      2.27 minutes elapsed
Search progress: 23 / 24 tasks, 99.77%      2.28 minutes elapsed
Search progress: 24 / 24 tasks, 100.00%     2.28 minutes elapsed
Search progress: 24 / 24 tasks, 100.00%     2.28 minutes elapsed
Writing results...
Writing results finished (elapsed time: 4.99 sec)
File: HH090441864_1001.mzid
MS-GF+ complete (total elapsed time: 167.35 sec)
JangHuang commented 1 month ago

For multiple peptides given same rank, I viewed scan 30 as an example:

>test2<-as.data.frame(id_filtered)
>test3<-filter(test2,spectrumID="scan=30")

>test3
  sequence spectrumID chargeState rank passThreshold experimentalMassToCharge calculatedMassToCharge
1  LERRFGR    scan=30           3    1          TRUE                 312.1834               311.8507
2 EFGKGIKR    scan=30           3    1          TRUE                 312.1834               312.1871
    peptideRef modNum isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
1  Pep_LERRFGR      0   FALSE    M   R   147 153 XP_023107941.2         321            
2 Pep_EFGKGIKR      0   FALSE    G   R   872 879 XP_044889830.1         935            
                                                                                  DatabaseDescription
1 XP_023107941.2 LOW QUALITY PROTEIN: putative testis-specific Y-encoded-like protein 3 [Felis catus]
2               XP_044889830.1 xin actin-binding repeat-containing protein 2 isoform X4 [Felis catus]
  scan.number.s. scan.start.time acquisitionNum                     spectrumFile                idFile
1             30         189.737             30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
2             30         189.737             30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
  MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue modPeptideRef modName modMass modLocation
1             20                33     8.410246e-07     9.655486          <NA>    <NA>      NA          NA
2             20                33     8.410246e-07     9.804787          <NA>    <NA>      NA          NA
  subOriginalResidue subReplacementResidue subLocation
1               <NA>                  <NA>          NA
2               <NA>                  <NA>          NA
alchemistmatt commented 1 month ago

When you ran MS-GF+, the value for NumMatchesPerSpec is set to the default, which is 1 (see https://msgfplus.github.io/msgfplus/MSGFPlus.html ). Consequently, it typically reports just one peptide for each spectrum, specifically, the one with the best score (MSGF SpecEValue closest to zero). If two (or more) peptides have the same value for MSGF SpecEValue, all of those identically scoring peptides will be reported, and all will be assigned a rank of 1. This behavior cannot be changed.

In the data you posted, there are two peptides, and each has MS.GF.SpecEValue = 8.410246e-07, as shown in this table:

sequence spectrumID chargeState rank passThreshold experimentalMassToCharge calculatedMassToCharge DatabaseAccess scan.number.s. MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue
LERRFGR scan=30 3 1 TRUE 312.1834 311.8507 XP_023107941.2 30 20 33 8.41E-07 9.655486
EFGKGIKR scan=30 3 1 TRUE 312.1834 312.1871 XP_044889830.1 30 20 33 8.41E-07 9.804787
JangHuang commented 1 month ago

I see!!! Thank you very much. I am surprised there can be 2 sequence matching to one spectrum with the exact same E Value! Should I assume there is something wrong with the scan? (Or can it be, for example, that my MS/MS resolution is too low?)

alchemistmatt commented 1 month ago

There is nothing wrong with the data; the algorithm simply couldn't determine which peptide is better based on the observed fragment ions. Also, note that 8.41E-07 is likely not a good score (it's not horrible, but it's not great). To double check the score, you should examine the QValue column, which is a representation of FDR. Steps:

Using options: mzid path: "F:\Temp\HH090441864_2.mzid" tsv path: "F:\Temp\HH090441864_2.tsv"

Unroll results: False Show decoy: False Single result per spectrum: False Delimited protein name list: False

No filters are in use

Wrote 7,227 results to F:\Temp\HH090441864_2.tsv

Conversion finished!


* Note: optionally use ` --unroll` to show the same PSM on multiple rows if it maps to multiple proteins
* Open HH090441864_2.tsv with Excel (or LibreOffice Calc or Google Sheets)
* The data is sorted by QValue, then SpecEValue
  * All of the data has QValue = 0, meaning your FASTA file did not have any decoy sequences
  * If you want to obtain Q Values, use the Protein Digestion Simulator to create a FASTA file that has both forward sequences, and reverse sequences, then re-run your search
  * Download from https://github.com/PNNL-Comp-Mass-Spec/Protein-Digestion-Simulator/releases
  * As this example output file shows, the decoy proteins (aka reverse sequences) have names that start with `XXX_`
    * https://github.com/PNNL-Comp-Mass-Spec/Protein-Digestion-Simulator/blob/master/ExampleData/QC_Standards_2004-01-21_decoy.fasta
* Looking again in file HH090441864_2.tsv, the confident PSMs have SpecEValue of roughly 1E-11 or smaller, though that's hard to say for sure since Q Values were not computed
* To determine how many spectra have multiple PSMs, I entered the following formula in cell R3
  * `=IF(B2=B3, "Duplicate", "")`
* I then used "Fill Down" to copy that formula to cells R4 through R7228
* Next, in cell S2, I entered the formula `=COUNTIF(R2:R7228, "Duplicate")`
  * That shows 668 spectra have more than one peptide
* However, if we only count spectra where the Spec EValue is less than 1E-11, we find that there are only 4 PSMs, out of 684 spectra, a frequency of 0.6%
JangHuang commented 1 month ago

Thank you very much for walking through this!! I re-ran the MSGFPlus with decoy database and got the Q values. With decoy database, I got 25 scans with >1 sequence matches. It looks like most of them have a high Q values.

> table(lengths(id_f_r$sequence))

   1    2 
1700   25 
> which(lengths(id_f_r$sequence) > 1) 
scan=1611 scan=2296 scan=2301 scan=2543 scan=2586  scan=322 scan=3333 scan=4104 
      121       276       279       334       339       479       506       683 
scan=4108 scan=4111 scan=4289 scan=4709 scan=4802 scan=4961 scan=5122 scan=6177 
      685       686       720       811       834       879       927      1179 
scan=6234 scan=6463 scan=6618 scan=6788 scan=6943 scan=7136 scan=7650 scan=7660 
     1198      1245      1278      1316      1367      1420      1546      1548 
 scan=804 
     1611 
> j<-which(lengths(id_f_r$sequence) > 1) 
> id_2 <- id_f_r[j,] %>% as.data.frame()
> id_2$MS.GF.PepQValue
$`scan=1611`
[1] 0.7354839

$`scan=2296`
[1] 0.00286533

$`scan=2301`
[1] 0.00286533

$`scan=2543`
[1] 0.2244898 0.7325194

$`scan=2586`
[1] 0.8282763

$`scan=322`
[1] 0.6208363

$`scan=3333`
[1] 0.7325194

$`scan=4104`
[1] 0.7819706

$`scan=4108`
[1] 0

$`scan=4111`
[1] 0

$`scan=4289`
[1] 0.3525281

$`scan=4709`
[1] 0.7964435 0.7750455

$`scan=4802`
[1] 0.679975

$`scan=4961`
[1] 0.7907869

$`scan=5122`
[1] 0.8061413 0.7500000

$`scan=6177`
[1] 0.8052044

$`scan=6234`
[1] 0.75061610 0.03722084

$`scan=6463`
[1] 0.7506161

$`scan=6618`
[1] 0.7354839 0.8272879

$`scan=6788`
[1] 0.7750455

$`scan=6943`
[1] 0.6046888

$`scan=7136`
[1] 0.7630828

$`scan=7650`
[1] 0.7907869

$`scan=7660`
[1] 0.6184593 0.6172839

$`scan=804`
[1] 0.6213934

> id_2$MS.GF.QValue
 [1] 0.536654830 0.001030928 0.001030928 0.529647200 0.682225350 0.414898220
 [7] 0.529605270 0.610359000 0.000000000 0.000000000 0.177472170 0.638776060
[13] 0.477038800 0.631447600 0.648354600 0.647137170 0.561809060 0.563022400
[19] 0.677466150 0.595163350 0.384287740 0.575487260 0.631256940 0.408746100
[25] 0.415982720

As such, I think I will try removing these 25 scans from the data all together and see how it works. Thank you very much!!

JangHuang commented 1 month ago

I removed the 25 scans with >1 peptide match and identical E values, and the issue got solved. My thanks for all the helps!!

> j<-which(lengths(id_f_r$sequence) > 1) 
> id_2 <- id_f_r[j,] %>% as.data.frame()
> id_2$MS.GF.QValue
 [1] 0.536654830 0.001030928 0.001030928 0.529647200 0.682225350 0.414898220 0.529605270
 [8] 0.610359000 0.000000000 0.000000000 0.177472170 0.638776060 0.477038800 0.631447600
[15] 0.648354600 0.647137170 0.561809060 0.563022400 0.677466150 0.595163350 0.384287740
[22] 0.575487260 0.631256940 0.408746100 0.415982720
> # most of them have PepQvalue >0.05 (=FDR >0.05)
> # Remove these scans with >1 matches
> id_f_r_unique <- id_f_r[-j,] 
> table(lengths(id_f_r_unique$sequence))

   1 
1700 
> sp_ident <- joinSpectraData(sp, id_f_r_unique,
+                       by.x = "spectrumId",
+                       by.y = "spectrumID")
> countIdentifications(sp_ident,
+                      identification = "sequence") 
rsession-arm64(56087) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
MSn data (Spectra) with 8492 spectra in a MsBackendMzR backend:
       msLevel     rtime scanIndex
     <integer> <numeric> <integer>
1            1   181.969         1
2            1   182.175         2
3            1   182.376         3
4            1   182.577         4
5            1   182.778         5
...        ...       ...       ...
8488         1   2701.40      8488
8489         2   2701.81      8489
8490         2   2702.14      8490
8491         2   2702.47      8491
8492         2   2702.81      8492
 ... 69 more variables/columns.

file(s):
HH070341484_2024-07-10_2515.d.mzML.gz