MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
76 stars 36 forks source link

Issue with spectrum indexing #113

Closed chrishuges closed 3 years ago

chrishuges commented 3 years ago

Describe the bug This is an issue I originally reported over on the SearchGUI GitHub page here. Basically, after processing an MGF input created using RawTools, there seems to be an error with indexing in specific files when the last entry of the MGF generates a match.

To Reproduce Details are on the SearchGUI issue, but you can download the raw file, MGF input, and search result from here. I am running SearchGUI (version 2.0.4) in the command line version with MS-GF+. The input/output for this command within SearchCLI that contains version details is below:

ms-gf+ command: 
/gsc/software/linux-x86_64-centos7/jdk-14.0.2/bin/java -Xmx120g -jar /projects/ptx_analysis/chughes/software/SearchGUI-4.0.4/resources/MS-GF+/MSGFPlus.jar -s /projects/ptx_results/OtherDataSets/dataset20201217_ccleProteomicsPmid31978347/dataProcessing_Prot_01/f04538_Prot_01_01.raw.mgf -d /projects/ptx_results/OtherDataSets/dataset20201217_ccleProteomicsPmid31978347/dataProcessing_Prot_01/uniprotHumanCrapTargetDecoyDec2020.fasta -o /projects/ptx_results/OtherDataSets/dataset20201217_ccleProteomicsPmid31978347/dataProcessing_Prot_01/.SearchGUI_temp/f04538_Prot_01_01.raw.msgf.mzid -t 1.0Da -tda 0 -mod /projects/ptx_analysis/chughes/software/SearchGUI-4.0.4/resources/MS-GF+/params/Mods.txt -numMods 2 -minCharge 2 -maxCharge 4 -inst 0 -thread 32 -m 0 -e 1 -ntt 2 -protocol 4 -minLength 8 -maxLength 30 -n 10 -addFeatures 0 -maxMissedCleavages 2 -ti 0,1 

Sat Dec 19 16:00:15 PST 2020 Processing f04538_Prot_01_01.raw.mgf with MS-GF+.

MS-GF+ Release (v2020.07.02) (5 August 2020)
Java 14.0.2 (Oracle Corporation)
Linux (amd64, version 3.10.0-693.5.2.el7.x86_64)
Loading database files...
Warning: Sequence database contains 72 counts of letter 'U', which does not correspond to an amino acid.
Warning: Sequence database contains 134 counts of letter 'X', which does not correspond to an amino acid.
Creating the suffix array indexed file... Size: 22857707
AlphabetSize: 28
Suffix creation: 48.12% complete.
Suffix creation: 87.06% complete.
Sorting suffixes... Size: 17210368
Sorting: 39.51% complete.
Sorting: 74.95% complete.
Counting number of distinct peptides in uniprotHumanCrapTargetDecoyDec2020.csarr using uniprotHumanCrapTargetDecoyDec2020.cnlcp
Loading database finished (elapsed time: 14.56 sec)
Reading spectra...
Ignoring 0 profile spectra.
Ignoring 5 spectra having less than 10 peaks.
Reading spectra finished (elapsed time: 30.46 sec)
Using 32 threads.
Search Parameters:
    PrecursorMassTolerance: 1.0 Da
    IsotopeError: 0,0
    TargetDecoyAnalysis: false
    FragmentationMethod: As written in the spectrum or CID if no info
    Instrument: LowRes (Low-res LCQ/LTQ)
    Enzyme: Tryp
    Enzyme file: /projects/ptx_analysis/chughes/software/SearchGUI-4.0.4/resources/MS-GF+/params/enzymes.txt
    Enzyme info: Added new enzyme Trypsin_(no_P_rule) with target residues RK
    Enzyme info: Added new enzyme CNBr with target residues M
    Enzyme info: Added new enzyme Pepsin_A with target residues LF
    Enzyme info: Added new enzyme Thermolysin with target residues AILMFV
    Enzyme info: Added new enzyme Lys-C_(no_P_rule) with target residues K
    Enzyme info: Added new enzyme Arg-C_(no_P_rule) with target residues R
    Enzyme info: Added new enzyme Chymotrypsin_(no_P_rule) with target residues YLFW
    Enzyme info: Added new enzyme Asp-N_(ambic) with target residues DE
    Enzyme info: Added new enzyme Arg-N with target residues R
    Enzyme info: Added new enzyme LysargiNase with target residues RK
    Protocol: TMT
    NumTolerableTermini: 2
    MinPepLength: 8
    MaxPepLength: 30
    MinCharge: 2
    MaxCharge: 4
    NumMatchesPerSpec: 10
    MaxMissedCleavages: 2
    MaxNumModsPerPeptide: 2
    ChargeCarrierMass: 1.00727649 (proton)
    MinNumPeaksPerSpectrum: 10
    NumIsoforms: 128
Post translational modifications in use:
    Fixed (static):     Carbamidomethyl on C (+57.0215)
    Fixed (static):     TMT6plex on K (+229.1629)
    Fixed (static):     TMT6plex on * at the peptide N-terminus (+229.1629)
    Variable (dynamic): Oxidation on M (+15.9949)

Spectrum 0-39261 (total: 39262)
Splitting work into 96 tasks.
Search progress: 0 / 96 tasks, 0.00%        0.01 seconds elapsed
Search progress: 0 / 96 tasks, 4.51%        1.00 minutes elapsed
Search progress: 1 / 96 tasks, 9.91%        1.16 minutes elapsed
Search progress: 2 / 96 tasks, 10.46%       1.17 minutes elapsed
Search progress: 3 / 96 tasks, 10.62%       1.17 minutes elapsed
Search progress: 4 / 96 tasks, 10.74%       1.18 minutes elapsed
Search progress: 5 / 96 tasks, 10.92%       1.18 minutes elapsed
Search progress: 6 / 96 tasks, 18.77%       1.27 minutes elapsed
Search progress: 7 / 96 tasks, 19.62%       1.30 minutes elapsed
Search progress: 8 / 96 tasks, 21.11%       1.32 minutes elapsed
Search progress: 9 / 96 tasks, 22.74%       1.34 minutes elapsed
Search progress: 10 / 96 tasks, 24.85%      1.37 minutes elapsed
Search progress: 11 / 96 tasks, 25.95%      1.38 minutes elapsed
Search progress: 12 / 96 tasks, 29.49%      1.44 minutes elapsed
Search progress: 13 / 96 tasks, 38.09%      1.59 minutes elapsed
Search progress: 14 / 96 tasks, 38.38%      1.60 minutes elapsed
Search progress: 15 / 96 tasks, 38.46%      1.60 minutes elapsed
Search progress: 16 / 96 tasks, 38.59%      1.61 minutes elapsed
Search progress: 17 / 96 tasks, 38.87%      1.61 minutes elapsed
Search progress: 18 / 96 tasks, 39.15%      1.62 minutes elapsed
Search progress: 19 / 96 tasks, 40.36%      1.64 minutes elapsed
Search progress: 20 / 96 tasks, 40.75%      1.64 minutes elapsed
Search progress: 21 / 96 tasks, 40.85%      1.64 minutes elapsed
Search progress: 22 / 96 tasks, 40.98%      1.65 minutes elapsed
Search progress: 23 / 96 tasks, 41.06%      1.65 minutes elapsed
Search progress: 23 / 96 tasks, 40.45%      1.65 minutes elapsed
Search progress: 25 / 96 tasks, 41.60%      1.66 minutes elapsed
Search progress: 26 / 96 tasks, 41.85%      1.66 minutes elapsed
Search progress: 27 / 96 tasks, 41.91%      1.66 minutes elapsed
Search progress: 28 / 96 tasks, 42.07%      1.67 minutes elapsed
Search progress: 29 / 96 tasks, 42.24%      1.67 minutes elapsed
Search progress: 30 / 96 tasks, 42.36%      1.67 minutes elapsed
Search progress: 31 / 96 tasks, 42.78%      1.69 minutes elapsed
Search progress: 32 / 96 tasks, 43.40%      1.71 minutes elapsed
Search progress: 33 / 96 tasks, 44.83%      1.77 minutes elapsed
Search progress: 34 / 96 tasks, 44.98%      1.77 minutes elapsed
Search progress: 35 / 96 tasks, 45.06%      1.78 minutes elapsed
Search progress: 36 / 96 tasks, 45.29%      1.79 minutes elapsed
Search progress: 37 / 96 tasks, 45.36%      1.79 minutes elapsed
Search progress: 38 / 96 tasks, 45.44%      1.80 minutes elapsed
Search progress: 39 / 96 tasks, 45.66%      1.82 minutes elapsed
Search progress: 40 / 96 tasks, 45.87%      1.84 minutes elapsed
Search progress: 41 / 96 tasks, 46.27%      1.85 minutes elapsed
Search progress: 42 / 96 tasks, 46.38%      1.86 minutes elapsed
Search progress: 43 / 96 tasks, 46.48%      1.86 minutes elapsed
Search progress: 44 / 96 tasks, 46.54%      1.87 minutes elapsed
Search progress: 44 / 96 tasks, 47.11%      2.00 minutes elapsed
Search progress: 45 / 96 tasks, 48.17%      2.13 minutes elapsed
Search progress: 46 / 96 tasks, 50.74%      2.39 minutes elapsed
Search progress: 47 / 96 tasks, 50.80%      2.39 minutes elapsed
Search progress: 48 / 96 tasks, 53.33%      2.52 minutes elapsed
Search progress: 49 / 96 tasks, 55.77%      2.64 minutes elapsed
Search progress: 50 / 96 tasks, 58.19%      2.74 minutes elapsed
Search progress: 51 / 96 tasks, 58.82%      2.77 minutes elapsed
Search progress: 52 / 96 tasks, 59.16%      2.79 minutes elapsed
Search progress: 53 / 96 tasks, 59.40%      2.80 minutes elapsed
Search progress: 54 / 96 tasks, 59.80%      2.81 minutes elapsed
Search progress: 55 / 96 tasks, 62.62%      2.92 minutes elapsed
Search progress: 56 / 96 tasks, 65.89%      2.99 minutes elapsed
Search progress: 56 / 96 tasks, 66.28%      3.00 minutes elapsed
Search progress: 57 / 96 tasks, 70.69%      3.08 minutes elapsed
Search progress: 58 / 96 tasks, 71.71%      3.11 minutes elapsed
Search progress: 59 / 96 tasks, 76.09%      3.20 minutes elapsed
Search progress: 60 / 96 tasks, 76.84%      3.24 minutes elapsed
Search progress: 61 / 96 tasks, 76.97%      3.25 minutes elapsed
Search progress: 62 / 96 tasks, 77.56%      3.27 minutes elapsed
Search progress: 63 / 96 tasks, 77.92%      3.28 minutes elapsed
Search progress: 64 / 96 tasks, 80.29%      3.36 minutes elapsed
Search progress: 65 / 96 tasks, 80.37%      3.36 minutes elapsed
Search progress: 66 / 96 tasks, 80.52%      3.37 minutes elapsed
Search progress: 67 / 96 tasks, 81.47%      3.40 minutes elapsed
Search progress: 68 / 96 tasks, 81.54%      3.40 minutes elapsed
Search progress: 69 / 96 tasks, 81.59%      3.40 minutes elapsed
Search progress: 70 / 96 tasks, 81.93%      3.41 minutes elapsed
Search progress: 71 / 96 tasks, 82.05%      3.41 minutes elapsed
Search progress: 72 / 96 tasks, 84.98%      3.49 minutes elapsed
Search progress: 73 / 96 tasks, 85.37%      3.51 minutes elapsed
Search progress: 74 / 96 tasks, 86.13%      3.52 minutes elapsed
Search progress: 75 / 96 tasks, 86.89%      3.54 minutes elapsed
Search progress: 76 / 96 tasks, 88.61%      3.60 minutes elapsed
Search progress: 77 / 96 tasks, 89.91%      3.64 minutes elapsed
Search progress: 78 / 96 tasks, 89.98%      3.64 minutes elapsed
Search progress: 79 / 96 tasks, 91.80%      3.69 minutes elapsed
Search progress: 80 / 96 tasks, 94.43%      3.77 minutes elapsed
Search progress: 81 / 96 tasks, 96.18%      3.82 minutes elapsed
Search progress: 82 / 96 tasks, 96.35%      3.83 minutes elapsed
Search progress: 83 / 96 tasks, 97.15%      3.86 minutes elapsed
Search progress: 84 / 96 tasks, 97.35%      3.86 minutes elapsed
Search progress: 85 / 96 tasks, 98.41%      3.91 minutes elapsed
Search progress: 86 / 96 tasks, 98.54%      3.92 minutes elapsed
Search progress: 87 / 96 tasks, 98.61%      3.92 minutes elapsed
Search progress: 88 / 96 tasks, 98.72%      3.93 minutes elapsed
Search progress: 89 / 96 tasks, 99.17%      3.99 minutes elapsed
Search progress: 89 / 96 tasks, 99.22%      4.00 minutes elapsed
Search progress: 90 / 96 tasks, 99.43%      4.03 minutes elapsed
Search progress: 91 / 96 tasks, 99.53%      4.04 minutes elapsed
Search progress: 92 / 96 tasks, 99.72%      4.08 minutes elapsed
Search progress: 93 / 96 tasks, 99.81%      4.09 minutes elapsed
Search progress: 94 / 96 tasks, 99.88%      4.10 minutes elapsed
Search progress: 95 / 96 tasks, 99.93%      4.10 minutes elapsed
Search progress: 96 / 96 tasks, 100.00%     4.13 minutes elapsed
Search progress: 96 / 96 tasks, 100.00%     4.13 minutes elapsed
Writing results...
Writing results finished (elapsed time: 67.65 sec)
File: /projects/ptx_results/OtherDataSets/dataset20201217_ccleProteomicsPmid31978347/dataProcessing_Prot_01/.SearchGUI_temp/f04538_Prot_01_01.raw.msgf.mzid
MS-GF+ complete (total elapsed time: 345.89 sec)

Additional context Happy to test other versions or provide more details/files if needed.

Thanks! Chris

FarmGeek4Life commented 3 years ago

See comment on the SearchGUI issue. I don't think the issue is with MS-GF+, because according to a text editor "count matches" of "BEGIN IONS", SeeMS (included with ProteoWizard), and the MS-GF+ console output, the MGF file contains a total of 39267 MS2 spectra, which means an index of 39266 would be valid.

chrishuges commented 3 years ago

Yes, my mistake. I should have checked the actual MS2 count in the MGF file. Thank you for looking into this so quickly, but I guess this can be closed!