Closed JangHuang closed 1 month ago
So, if you look at the SpectrumIdentificationItem entries under a single SpectrumIdentificationResult, there should only be duplicate values for 'rank' in that same SpectrumIdentificationResult if their scores are identical. This is according to the mzIdentML specification, which states for 'rank': For an MS/MS result set, this is the rank of the identification quality as scored by the search engine. 1 is the top rank. If multiple identifications have the same top score, they should all be assigned rank =1. For PMF data, the rank attribute may be meaningless and values of rank = 0 should be given.
So, if those multiple matches with rank 1 for the same spectrum have the same specEValue, then the mzid is valid, and the issue lies in the Spectra package.
So, if you look at the SpectrumIdentificationItem entries under a single SpectrumIdentificationResult, there should only be duplicate values for 'rank' in that same SpectrumIdentificationResult if their scores are identical. This is according to the mzIdentML specification, which states for 'rank':
For an MS/MS result set, this is the rank of the identification quality as scored by the search engine. 1 is the top rank. If multiple identifications have the same top score, they should all be assigned rank =1. For PMF data, the rank attribute may be meaningless and values of rank = 0 should be given.
So, if those multiple matches with rank 1 for the same spectrum have the same specEValue, then the mzid is valid, and the issue lies in the Spectra package.
Thank you very much for the respond! Is it reasonable that I have multiple peptide matches with the exact same score? Or should I assume there's something wrong with my parameters? I also tried specifying the "-n NumMatchesPerSpec" in MSGFPlus as below:
java -Xmx3500M -jar MSGFPlus.jar -s "/Users/jjaneyellow/Desktop/CBS-PhD Program/Methodology/MS-Proteomic (Renin AGT Alb)/Proteomic/20240925_MascotSearch/HH090441864_2024-09-09_2648.mzML" -d "/Users/jjaneyellow/Desktop/CBS-PhD Program/Methodology/MS-Proteomic (Renin AGT Alb)/F.catus_Fca126_mat1.0/ncbi_dataset/data/GCF_018350175.1/protein.faa" -tda 0 -inst 2 -e 1 -maxMissedCleavages 2 -o HH090441864_2.mzid -n 1
but I got the exact same error when running countIdentifications.
And... you can ignore almost everything I said in that reply. Your .mzid file does not have any cases of '2 different peptides matched the same spectrum with the same score'; all of the results in that file are a single peptide match per spectrum, and therefore all results are 'rank 1', since they are the highest-scoring peptide match for that spectrum.
What you are seeing is multiple 'PeptideEvidence' entries; 'PeptideEvidence' in .mzid files is a combination of protein, location in protein, start and end residues, and modified peptide. If you were to map those entries, you would see that each one links to the same modified peptide; they are different locations and proteins where that peptide occurs.
I think I do have multiple match to the same spectrum?
table(table(id_filtered $spectrumID))
id_tb<- table(id_filtered$spectrumID) %>% as.data.frame() #find the scans with 4 matches
id_4 <-id_filtered[id_filtered$spectrumID == "scan=2192", ] %>% as.data.frame()
id_4[, c("modName", "modLocation")]
After reducing the PSM object, each spectrum was reduced into one row; but I was wondering where this caused problems for countIdentifications.
id_f_r<-reducePSMs(id_filtered)
table(table(id_f_r $spectrumID))
But, when looking at the result with spectrumID="scan=2192", there is still only one peptide match, and only one PeptideEvidenceRef, meaning that this is also a single protein match. spectrumID="scan=5549" has 2 PeptideEvidenceRef entries, and when I look at them they refer to the exact same peptide sequence, in two different proteins.
I can see you are following the example from https://www.bioconductor.org/packages/release/bioc/vignettes/PSMatch/inst/doc/Fragments.html, but I noticed one error you have: why are you trying to join the identifications from HH090441864_2024-09-09_2648.mzML
with the spectral data from HH090441860_2024-09-09_2646.mzML
?
Also, looking at the code for countIdentifications, you might need to change how you call that; there is a second parameter that has a default value of 'sequence' that you probably need to change; see https://github.com/rformassspectrometry/Spectra/blob/1bb1a1d7ee5e313cb1e62d72d2d25a360ff29392/R/countIdentifications.R
Thank you very much for spotting the HH860 mistake!! I've corrected the code in the post. I still encountered the same error using both mzML file and mzId file from HH090441864; the files provided in figshare link were also correct (both were from the scan HH090441864). And yes exactly, I am following the textbook from the same author: https://rformassspectrometry.github.io/book/sec-id.html
And thank you so much for providing the information about countIdentifications codes! I'm sorry if the answer is obvious, but I couldn't find a suitable parameter to substitute "sequence"... is there any parameter that should work and would make good sense?
Thank you!
You will need to look at the column names in the DataFrame you are merging with the spectra, and find one that you want to use, or else use a script to add one; the code states that it wants a variable/column that will either be some value (like a peptide sequence) for a match, or 'NA' for no match.
I tried working on this, and I think it's possible to add one column or randomly remove some of the redundant peptide matches. However, I'm not sure how legit it is to do so, given that they were all ranked 1 by MSGFPlus...
In case this helps, these were the terminal messages I saw when running MSGFPlus:
java -Xmx3500M -jar MSGFPlus.jar -s "HH090441864_2024-09-09_2648.mzML" -d "protein.faa" -tda 0 -inst 2 -e 1 -maxMissedCleavages 2 -o HH090441864_1001.mzid
MS-GF+ Release (v2024.03.26) (26 March 2024)
Java 1.8.0_421 (Oracle Corporation)
Mac OS X (aarch64, version 14.5)
Loading database files...
Warning: Sequence database contains 53 counts of letter 'U', which does not correspond to an amino acid.
Warning: Sequence database contains 1062 counts of letter 'X', which does not correspond to an amino acid.
Counting number of distinct peptides in protein.csarr using protein.cnlcp
Counting distinct peptides: 62.73% complete.
Loading database finished (elapsed time: 3.53 sec)
Reading spectra...
Opening mzML file HH090441864_2024-09-09_2648.mzML
Ignoring 0 profile spectra.
Ignoring 0 spectra having less than 10 peaks.
Reading spectra finished (elapsed time: 25.24 sec)
Using 8 threads.
Search Parameters:
PrecursorMassTolerance: 20.0 ppm
IsotopeError: 0,1
TargetDecoyAnalysis: false
FragmentationMethod: As written in the spectrum or CID if no info
Instrument: TOF
Enzyme: Tryp
Protocol: Automatic
NumTolerableTermini: 2
IgnoreMetCleavage: false
MinPepLength: 6
MaxPepLength: 40
MinCharge: 2
MaxCharge: 3
NumMatchesPerSpec: 1
MaxMissedCleavages: 2
MaxNumModsPerPeptide: 3
ChargeCarrierMass: 1.00727649 (proton)
MinNumPeaksPerSpectrum: 10
NumIsoforms: 128
Post translational modifications in use:
Fixed (static): Carbamidomethyl on C (+57.0215)
Spectrum 0-9092 (total: 9093)
Splitting work into 24 tasks.
Search progress: 0 / 24 tasks, 0.00% 0.00 seconds elapsed
Search progress: 1 / 24 tasks, 29.30% 26.34 seconds elapsed
Search progress: 2 / 24 tasks, 31.93% 30.96 seconds elapsed
Search progress: 3 / 24 tasks, 35.29% 36.10 seconds elapsed
Search progress: 4 / 24 tasks, 38.74% 40.83 seconds elapsed
Search progress: 5 / 24 tasks, 40.09% 43.30 seconds elapsed
Search progress: 6 / 24 tasks, 40.83% 44.07 seconds elapsed
Search progress: 7 / 24 tasks, 42.34% 45.73 seconds elapsed
Search progress: 8 / 24 tasks, 45.09% 47.51 seconds elapsed
Search progress: 8 / 24 tasks, 58.45% 1.00 minutes elapsed
Search progress: 9 / 24 tasks, 61.85% 1.11 minutes elapsed
Search progress: 10 / 24 tasks, 67.28% 1.35 minutes elapsed
Search progress: 11 / 24 tasks, 71.24% 1.52 minutes elapsed
Search progress: 12 / 24 tasks, 72.17% 1.56 minutes elapsed
Search progress: 13 / 24 tasks, 75.76% 1.61 minutes elapsed
Search progress: 14 / 24 tasks, 77.33% 1.66 minutes elapsed
Search progress: 15 / 24 tasks, 80.14% 1.69 minutes elapsed
Search progress: 16 / 24 tasks, 81.32% 1.70 minutes elapsed
Search progress: 17 / 24 tasks, 95.57% 1.99 minutes elapsed
Search progress: 17 / 24 tasks, 95.71% 2.00 minutes elapsed
Search progress: 18 / 24 tasks, 97.37% 2.14 minutes elapsed
Search progress: 19 / 24 tasks, 97.99% 2.19 minutes elapsed
Search progress: 20 / 24 tasks, 98.33% 2.20 minutes elapsed
Search progress: 21 / 24 tasks, 99.07% 2.25 minutes elapsed
Search progress: 22 / 24 tasks, 99.44% 2.27 minutes elapsed
Search progress: 23 / 24 tasks, 99.77% 2.28 minutes elapsed
Search progress: 24 / 24 tasks, 100.00% 2.28 minutes elapsed
Search progress: 24 / 24 tasks, 100.00% 2.28 minutes elapsed
Writing results...
Writing results finished (elapsed time: 4.99 sec)
File: HH090441864_1001.mzid
MS-GF+ complete (total elapsed time: 167.35 sec)
For multiple peptides given same rank, I viewed scan 30 as an example:
>test2<-as.data.frame(id_filtered)
>test3<-filter(test2,spectrumID="scan=30")
>test3
sequence spectrumID chargeState rank passThreshold experimentalMassToCharge calculatedMassToCharge
1 LERRFGR scan=30 3 1 TRUE 312.1834 311.8507
2 EFGKGIKR scan=30 3 1 TRUE 312.1834 312.1871
peptideRef modNum isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
1 Pep_LERRFGR 0 FALSE M R 147 153 XP_023107941.2 321
2 Pep_EFGKGIKR 0 FALSE G R 872 879 XP_044889830.1 935
DatabaseDescription
1 XP_023107941.2 LOW QUALITY PROTEIN: putative testis-specific Y-encoded-like protein 3 [Felis catus]
2 XP_044889830.1 xin actin-binding repeat-containing protein 2 isoform X4 [Felis catus]
scan.number.s. scan.start.time acquisitionNum spectrumFile idFile
1 30 189.737 30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
2 30 189.737 30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue modPeptideRef modName modMass modLocation
1 20 33 8.410246e-07 9.655486 <NA> <NA> NA NA
2 20 33 8.410246e-07 9.804787 <NA> <NA> NA NA
subOriginalResidue subReplacementResidue subLocation
1 <NA> <NA> NA
2 <NA> <NA> NA
When you ran MS-GF+, the value for NumMatchesPerSpec is set to the default, which is 1 (see https://msgfplus.github.io/msgfplus/MSGFPlus.html ). Consequently, it typically reports just one peptide for each spectrum, specifically, the one with the best score (MSGF SpecEValue closest to zero). If two (or more) peptides have the same value for MSGF SpecEValue, all of those identically scoring peptides will be reported, and all will be assigned a rank of 1. This behavior cannot be changed.
In the data you posted, there are two peptides, and each has MS.GF.SpecEValue = 8.410246e-07, as shown in this table:
sequence | spectrumID | chargeState | rank | passThreshold | experimentalMassToCharge | calculatedMassToCharge | DatabaseAccess | scan.number.s. | MS.GF.RawScore | MS.GF.DeNovoScore | MS.GF.SpecEValue | MS.GF.EValue |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LERRFGR | scan=30 | 3 | 1 | TRUE | 312.1834 | 311.8507 | XP_023107941.2 | 30 | 20 | 33 | 8.41E-07 | 9.655486 |
EFGKGIKR | scan=30 | 3 | 1 | TRUE | 312.1834 | 312.1871 | XP_044889830.1 | 30 | 20 | 33 | 8.41E-07 | 9.804787 |
I see!!! Thank you very much. I am surprised there can be 2 sequence matching to one spectrum with the exact same E Value! Should I assume there is something wrong with the scan? (Or can it be, for example, that my MS/MS resolution is too low?)
There is nothing wrong with the data; the algorithm simply couldn't determine which peptide is better based on the observed fragment ions. Also, note that 8.41E-07 is likely not a good score (it's not horrible, but it's not great). To double check the score, you should examine the QValue column, which is a representation of FDR. Steps:
MzidToTsvConverter.exe HH090441864_2.mzid
Using options: mzid path: "F:\Temp\HH090441864_2.mzid" tsv path: "F:\Temp\HH090441864_2.tsv"
Unroll results: False Show decoy: False Single result per spectrum: False Delimited protein name list: False
No filters are in use
Wrote 7,227 results to F:\Temp\HH090441864_2.tsv
Conversion finished!
* Note: optionally use ` --unroll` to show the same PSM on multiple rows if it maps to multiple proteins
* Open HH090441864_2.tsv with Excel (or LibreOffice Calc or Google Sheets)
* The data is sorted by QValue, then SpecEValue
* All of the data has QValue = 0, meaning your FASTA file did not have any decoy sequences
* If you want to obtain Q Values, use the Protein Digestion Simulator to create a FASTA file that has both forward sequences, and reverse sequences, then re-run your search
* Download from https://github.com/PNNL-Comp-Mass-Spec/Protein-Digestion-Simulator/releases
* As this example output file shows, the decoy proteins (aka reverse sequences) have names that start with `XXX_`
* https://github.com/PNNL-Comp-Mass-Spec/Protein-Digestion-Simulator/blob/master/ExampleData/QC_Standards_2004-01-21_decoy.fasta
* Looking again in file HH090441864_2.tsv, the confident PSMs have SpecEValue of roughly 1E-11 or smaller, though that's hard to say for sure since Q Values were not computed
* To determine how many spectra have multiple PSMs, I entered the following formula in cell R3
* `=IF(B2=B3, "Duplicate", "")`
* I then used "Fill Down" to copy that formula to cells R4 through R7228
* Next, in cell S2, I entered the formula `=COUNTIF(R2:R7228, "Duplicate")`
* That shows 668 spectra have more than one peptide
* However, if we only count spectra where the Spec EValue is less than 1E-11, we find that there are only 4 PSMs, out of 684 spectra, a frequency of 0.6%
Thank you very much for walking through this!! I re-ran the MSGFPlus with decoy database and got the Q values. With decoy database, I got 25 scans with >1 sequence matches. It looks like most of them have a high Q values.
> table(lengths(id_f_r$sequence))
1 2
1700 25
> which(lengths(id_f_r$sequence) > 1)
scan=1611 scan=2296 scan=2301 scan=2543 scan=2586 scan=322 scan=3333 scan=4104
121 276 279 334 339 479 506 683
scan=4108 scan=4111 scan=4289 scan=4709 scan=4802 scan=4961 scan=5122 scan=6177
685 686 720 811 834 879 927 1179
scan=6234 scan=6463 scan=6618 scan=6788 scan=6943 scan=7136 scan=7650 scan=7660
1198 1245 1278 1316 1367 1420 1546 1548
scan=804
1611
> j<-which(lengths(id_f_r$sequence) > 1)
> id_2 <- id_f_r[j,] %>% as.data.frame()
> id_2$MS.GF.PepQValue
$`scan=1611`
[1] 0.7354839
$`scan=2296`
[1] 0.00286533
$`scan=2301`
[1] 0.00286533
$`scan=2543`
[1] 0.2244898 0.7325194
$`scan=2586`
[1] 0.8282763
$`scan=322`
[1] 0.6208363
$`scan=3333`
[1] 0.7325194
$`scan=4104`
[1] 0.7819706
$`scan=4108`
[1] 0
$`scan=4111`
[1] 0
$`scan=4289`
[1] 0.3525281
$`scan=4709`
[1] 0.7964435 0.7750455
$`scan=4802`
[1] 0.679975
$`scan=4961`
[1] 0.7907869
$`scan=5122`
[1] 0.8061413 0.7500000
$`scan=6177`
[1] 0.8052044
$`scan=6234`
[1] 0.75061610 0.03722084
$`scan=6463`
[1] 0.7506161
$`scan=6618`
[1] 0.7354839 0.8272879
$`scan=6788`
[1] 0.7750455
$`scan=6943`
[1] 0.6046888
$`scan=7136`
[1] 0.7630828
$`scan=7650`
[1] 0.7907869
$`scan=7660`
[1] 0.6184593 0.6172839
$`scan=804`
[1] 0.6213934
> id_2$MS.GF.QValue
[1] 0.536654830 0.001030928 0.001030928 0.529647200 0.682225350 0.414898220
[7] 0.529605270 0.610359000 0.000000000 0.000000000 0.177472170 0.638776060
[13] 0.477038800 0.631447600 0.648354600 0.647137170 0.561809060 0.563022400
[19] 0.677466150 0.595163350 0.384287740 0.575487260 0.631256940 0.408746100
[25] 0.415982720
As such, I think I will try removing these 25 scans from the data all together and see how it works. Thank you very much!!
I removed the 25 scans with >1 peptide match and identical E values, and the issue got solved. My thanks for all the helps!!
> j<-which(lengths(id_f_r$sequence) > 1)
> id_2 <- id_f_r[j,] %>% as.data.frame()
> id_2$MS.GF.QValue
[1] 0.536654830 0.001030928 0.001030928 0.529647200 0.682225350 0.414898220 0.529605270
[8] 0.610359000 0.000000000 0.000000000 0.177472170 0.638776060 0.477038800 0.631447600
[15] 0.648354600 0.647137170 0.561809060 0.563022400 0.677466150 0.595163350 0.384287740
[22] 0.575487260 0.631256940 0.408746100 0.415982720
> # most of them have PepQvalue >0.05 (=FDR >0.05)
> # Remove these scans with >1 matches
> id_f_r_unique <- id_f_r[-j,]
> table(lengths(id_f_r_unique$sequence))
1
1700
> sp_ident <- joinSpectraData(sp, id_f_r_unique,
+ by.x = "spectrumId",
+ by.y = "spectrumID")
> countIdentifications(sp_ident,
+ identification = "sequence")
rsession-arm64(56087) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
MSn data (Spectra) with 8492 spectra in a MsBackendMzR backend:
msLevel rtime scanIndex
<integer> <numeric> <integer>
1 1 181.969 1
2 1 182.175 2
3 1 182.376 3
4 1 182.577 4
5 1 182.778 5
... ... ... ...
8488 1 2701.40 8488
8489 2 2701.81 8489
8490 2 2702.14 8490
8491 2 2702.47 8491
8492 2 2702.81 8492
... 69 more variables/columns.
file(s):
HH070341484_2024-07-10_2515.d.mzML.gz
Hello,
I used MS-GF+ on the mzML file, using the .faa file as reference, and produce the .mzid file. All files are in the following link: https://figshare.com/s/b65fc594da19f0f9347f . The raw data was produced by Bruker Impact II (Q-TOF MS/MS), and was transformed into mzML file by ProteoWizard msconvert.
However, the mzid file (HH090441864_2) ended up matching multiple peptides to one same spectrum, and they were all ranked 1. It seems to have led to some problem when I try to joint the identification data into the spectrum data (using command countIdentifications in the package Spectra). My codes were as below. The MS-GF+ was run in Mac terminal; other codes were in R Studio.
Thank you in advance for any thoughts!