UWPR / Comet

An tandem mass spectrometry (MS/MS) sequence database search tool.
https://uwpr.github.io/Comet/
Apache License 2.0
45 stars 13 forks source link

Problem with scan numbers #23

Closed wsnoble closed 2 years ago

wsnoble commented 2 years ago

I am trying to run Comet on the attached MGF file, but it's having trouble with parsing the scan numbers. I guess it's because they have unusual formatting ("SCANS[0]=XXXX"). But even if Comet assigns its own scan numbers, it seems like it's re-using numbers because I'm seeing 148 PSMs with scan number 1, followed by a large number for scan 2, etc. Any idea what's going on?

Here is the log:

 Comet version "2022.01 rev. 1 (11cb28f)"

 Search start:  08/03/2022, 06:05:51 AM
 - Input file: fromjack/export_Mzidentml_F134282.mgf
   - Load spectra: 15022
     - Search progress:   
     - Post analysis:  done
   - Load spectra: 12620
     - Search progress:  
     - Post analysis:  done
 Search end:    08/03/2022, 06:17:00 AM, 11m:9s

comet.params.txt export_Mzidentml_F134282.mgf.gz export_Mzidentml_F134282.pin.gz

mhoopmann commented 2 years ago

I think I know what that unusual format is: If the MGF is compiled from multiple runs, that would be the index of the run. Normally when there is only a single run, that index isn't specified. Don't know why it was here, but it should be dealt with properly. I'll start fiddling with the MSToolkit to properly handle these cases.

jke000 commented 2 years ago

Thanks Mike. I was in the middle of composing a reply that I just deleted when you replied. Glad this issue is in good hands.

btw, there are multiple entries in this mgf file with the same SCAN line. For example "SCANS[0]=11643" appears twice in the file with slightly different PEPMASS masses. This just means that even if this unusual format were handled, the scan numbers in this particular file wouldn't all be unique. (I think it's perfectly fine that the output results simply reports the scan number specified in the input file even if they aren't unique; modify the input file if duplicate scan numbers aren't desired.)

BEGIN IONS
CHARGE=2+
PEPMASS=464.14355 11275.042
RAWSCANS[0]=sn11643
RTINSECONDS[0]=2618.2839
SCANS[0]=11643
TITLE=153: Scan 11643 (rt=43.6381) [\\romnas.ugent.be\storage\Proteomics\5911\H03892_5911_EXT392_ArnesenLab_TEST_SCX_RT_KO.raw]
210.498360 95.7
361.025340 149.5
431.083450 38.57
615.322000 60.72
242.111860 37.47
303.326080 44.29
408.637900 32.75
626.542120 36.86
198.945340 29.8
374.426890 35.8
684.879750 34.72
335.005230 33.79
END IONS

BEGIN IONS
CHARGE=2+
PEPMASS=464.23579 18546.72
RAWSCANS[0]=sn11643
RTINSECONDS[0]=2618.2839
SCANS[0]=11643
TITLE=153: Scan 11643 (rt=43.6381) [\\romnas.ugent.be\storage\Proteomics\5911\H03892_5911_EXT392_ArnesenLab_TEST_SCX_RT_KO.raw]
210.498360 95.7
361.025340 149.5
431.083450 38.57
615.322000 60.72
242.111860 37.47
303.326080 44.29
408.637900 32.75
626.542120 36.86
198.945340 29.8
374.426890 35.8
684.879750 34.72
335.005230 33.79
END IONS
jke000 commented 2 years ago

Incorporated MSToolkit update to fix this issue in release 2022.01.2.