compomics / ThermoRawFileParser

Thermo RAW file parser that runs on Linux/Mac and all other platforms that support Mono
Apache License 2.0
186 stars 50 forks source link

Error last build of TRFP #187

Closed ypriverol closed 1 week ago

ypriverol commented 1 week ago

While testing quantms, we found https://github.com/bigbio/quantms/issues/432 an error in some of the raw files:

Progress of 'loading mzML':
  Progress of 'loading spectra list':
  -- done [took 14.11 s (CPU), 14.10 s (Wall)] --
  Progress of 'loading chromatogram list':
  -- done [took 0.00 s (CPU), 0.00 s (Wall)] --
-- done [took 14.12 s (CPU), 14.11 s (Wall) @ 48.98 MiB/s] --
Memory usage (loading MS data): 853 MB (working set delta), 855 MB (peak working set delta)
-- General information --
File name: UPS1_125amol_R2.mzML
File type: mzML
Instrument: LTQ Orbitrap Velos
  Mass Analyzer: Fourier transform ion cyclotron resonance mass spectrometer (resolution: 0)
MS levels: 1, 2
Total number of peaks: 49814125
Number of spectra: 44115
Ranges:
  retention time: 0.33 .. 9299.35 sec (155.0 min)
  mass-to-charge: 0.00 .. 1999.96
    ion mobility: <none>
       intensity: 1.00 .. 137152720.00
Number of spectra per MS level:
  level 1: 7302
  level 2: 36813
Peak type from metadata (or estimated from data)
  level 1: Centroid (Centroid)
  level 2: Centroid (Centroid)
Activation methods
    MS-Level 2 & CID (Collision-induced dissociation): 36813
Precursor charge distribution:
  charge 2: 20840x
  charge 3: 13151x
  charge 4: 2427x
  charge 5: 347x
  charge 6: 48x
Number of chromatograms: 1
Number of chromatographic peaks: 44115
Number of chromatograms per type:
  base peak chromatogram:                         1
FileInfo took 14.59 s (wall), 28.19 s (CPU), 0.59 s (system), 27.60 s (user); Peak Memory Usage: 891 MB.

[8:27](https://bigbioworkspace.slack.com/archives/D046E5YGFHS/p1729664854778759)
latest version:
Progress of 'loading mzML':
  Progress of 'loading spectra list':
  -- done [took 13.72 s (CPU), 13.73 s (Wall)] --
  Progress of 'loading chromatogram list':
  -- done [took 0.00 s (CPU), 0.00 s (Wall)] --
-- done [took 13.73 s (CPU), 13.75 s (Wall) @ 50.78 MiB/s] --
Memory usage (loading MS data): 853 MB (working set delta), 855 MB (peak working set delta)
-- General information --
File name: UPS1_125amol_R2.mzML
File type: mzML
Instrument: LTQ Orbitrap Velos
  Mass Analyzer: Fourier transform ion cyclotron resonance mass spectrometer (resolution: 0)
MS levels: 1, 2
Total number of peaks: 49837600
Number of spectra: 44115
Ranges:
  retention time: 0.33 .. 9299.35 sec (155.0 min)
  mass-to-charge: 0.00 .. 75550803599127773984755715932160.00
    ion mobility: <none>
       intensity: 0.00 .. 306005508541549913785241933185024.00
Number of spectra per MS level:
  level 1: 7302
  level 2: 36813
Peak type from metadata (or estimated from data)
  level 1: Centroid (Centroid)
  level 2: Centroid (Centroid)
Activation methods
    MS-Level 2 & CID (Collision-induced dissociation): 36813
Precursor charge distribution:
  charge 2: 20840x
  charge 3: 13151x
  charge 4: 2427x
  charge 5: 347x
  charge 6: 48x
Number of chromatograms: 1
Number of chromatographic peaks: 44115
Number of chromatograms per type:
  base peak chromatogram:                         1
FileInfo took 14.24 s (wall), 27.46 s (CPU), 0.62 s (system), 26.84 s (user); Peak Memory Usage: 891 MB.
caetera commented 1 week ago

@ypriverol These are the files that produce a OutOfMemory error. Did I understand this right?

These are the two files in PRIDE:

https://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_125amol_R1.raw https://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_125amol_R2.raw

ypriverol commented 1 week ago

Yes @caetera . Thanks a lot for looking into this.

jpfeuffer commented 1 week ago

Thanks! Just to clarify: It affects only the second file. And the problem is that if you look at the output that yasset posted, the OpenMS mzml parser parses at least one spectrum that has an incredibly high m/z value. So high in fact that it is probably the result of an uninitialized random memory location. It could be that the annotated length of the mzarray does not match the actual length of the spectrum to be decoded anymore. Or something even more strange.

Note also: the first output in the first post of this issue is from trfp 1.3.6, while the second (faulty) output is from the latest version.

The out of memory error is a downstream result of this parsing for algorithms that are allocating memory proportional to the m/z range of an mzml file.

jpfeuffer commented 1 week ago

Maybe also important: Note the different number of peaks that are parsed (while the number of spectra stayed the same)

caetera commented 1 week ago

I had a chance to look into it.

The mzML file provided here - https://github.com/bigbio/quantms/issues/432 indeed have a scan 34628 that has very large m/z values - [1.968968e-19 -- 7.555080e+31], this is a likely cause for error

However, when I downloaded the RAW file from PRIDE and tried to reproduce the error in TRFP, I could not. The corresponding scan is fine - [352.092743 -- 1799.879639] (as well as all other) .

I can see in the mzML file that SHA-1 sum for your RAW file was <cvParam cvRef="MS" accession="MS:1000569" value="7653a7116752cc168f9b7890c80fa4ab3edfea31" name="SHA-1" />, however, the file I got from PRIDE returns <cvParam cvRef="MS" accession="MS:1000569" value="8e16697ebbc3962b09a90385579fe79552a7d98c" name="SHA-1" />

Could it be that the RAW file used for the conversion in TRFP got corrupted? Could you, please, try to download the file again and reprocess it?

ypriverol commented 1 week ago

Interesting, that happens in two different machines, in two different places.

jpfeuffer commented 1 week ago

@ypriverol did we really try two different downloads of the file though? I only know of the one file that Dai shared.

caetera commented 1 week ago

I tested 2 different Windows (10 + 11) systems, and two Linux (Ubuntu 20.04 LTS and 24.04 LTS, both running latest version of Mono), i.e. 4 downloads in total - SHA-1 checksums (calculated by TRFP and by sha1sum) are consistent between the downloads, and differ to the one in the shared mzML. Conversion is successful in all cases.

Honestly, I tend to believe that is an issue outside TRFP, first, since the checksum is different, second, since the code used for spectral data array creation was not changed since version 1.2.x. If it was working before it should not be broken now.

Could you provide more information on the platform you are running on? Is it containerized?

ypriverol commented 1 week ago

I have been trying to reproduce the error without success. We can leave it for now. Im adding checksum to SDRF, and also to our pipeline for the future, to trace this better.