compomics / ThermoRawFileParser

Thermo RAW file parser that runs on Linux/Mac and all other platforms that support Mono
Apache License 2.0
181 stars 47 forks source link

Need to read deconvoluted RAW built by FreeStyle #141

Open dtabb73 opened 2 years ago

dtabb73 commented 2 years ago

Hello, I am deconvoluting inclusion-list and targeted RAW files for top-down proteomics. FreeStyle 1.8's "Xtract All" feature creates a new, deconvolved RAW from a larger input RAW. Unfortunately I cannot seem to read the deconvoluted RAW in any of the tools I might use to make an MGF (and then an msAlign for TopPIC). ProteoWizard and OpenMS both die trying to read FreeStyle's deconvolved RAW.

I am testing with 20140723_F_EV_Hp1+Ni_repl1_Inclusion_HCD25_Xtract.raw from PXD032724.

This is the error ThermoRawFileParser generates when reading my experiment: ThermoRawFileParser.exe -i C:\Xcalibur\data\20140723_F_EV_Hp1+Ni_repl1_Inclusion_HCD25_Xtract_SN1.Raw -f 0 2022-07-06 08:32:37 INFO Started parsing C:\Xcalibur\data\20140723_F_EV_Hp1+Ni_repl1_Inclusion_HCD25_Xtract_SN1.Raw 2022-07-06 08:32:37 INFO Processing 2236 scans 10% 2022-07-06 08:32:37 ERROR An unexpected error occured while parsing file:C:\Xcalibur\data\20140723_F_EV_Hp1+Ni_repl1_Inclusion_HCD25_Xtract_SN1.Raw 2022-07-06 08:32:37 ERROR System.ArgumentOutOfRangeException: There is no record available for scan 339. Available records: 1. Refer to Run Header Parameter name: scanNumber at ThermoFisher.CommonCore.RawFileReader.StructWrappers.VirtualDevices.MassSpecDevice.GetValidatedTrailerExtraBlob(Int32 scanNumber) at ThermoFisher.CommonCore.RawFileReader.StructWrappers.VirtualDevices.MassSpecDevice.GetTrailerExtra(Int32 scanNumber) at ThermoFisher.CommonCore.RawFileReader.RawFileAccessBase.GetTrailerExtraInformation(Int32 scanNumber) at ThermoRawFileParser.Writer.MgfSpectrumWriter.Write(IRawDataPlus rawFile, Int32 firstScanNumber, Int32 lastScanNumber) in D:\Code\ThermoRawFileParser\Writer\MgfSpectrumWriter.cs:line 111 at ThermoRawFileParser.RawFileParser.ProcessFile(ParseInput parseInput) in D:\Code\ThermoRawFileParser\RawFileParser.cs:line 134 at ThermoRawFileParser.RawFileParser.TryProcessFile(ParseInput parseInput) in D:\Code\ThermoRawFileParser\RawFileParser.cs:line 62

Thank you!

caetera commented 2 years ago

Hi @dtabb73 , Thank you for reporting. From the stack trace it seems that one of the scans is missing the header information. I am not sure if it happens often with deconvoluted files. Is it possible to share the problematic file, so I can look at it?

dtabb73 commented 2 years ago

I believe that this file will produce the same error, while being a simpler example (it's an inclusion list experiment with only four m/z values on the inclusion list). https://drive.google.com/file/d/1ycQiwAlcCHMWqfJlg7msdnKKoVrwZ6LK/view?usp=sharing

I created it in the course of filming the demonstrator video I posted to YouTube today. https://www.youtube.com/watch?v=aLHmG7R8uu4

caetera commented 2 years ago

Hi @dtabb73, I can confirm that the problem is missing header information (so-called trailer). That seems to be a peculiarity of "Xtracted" files. The second file you shared have it missing for all scans except the first one. I have implemented a patch to show a warning to the user instead of crashing. The file is processed successfully, however we rely on the trailer for some of the processing, for example, charge state and monoisotopic mass, thus, these might not be calculated as expected. I can share the updated version with you so you can test it further. Do you believe that warning is enough or something more "disturbing" is necessary, for example, user has to specify a command line parameter to treat that sort of errors as warnings (like we do with -e for instrument properties errors)?

dtabb73 commented 2 years ago

I think flagging missing trailers with a warning is an appropriate response. I appreciate your looking into these files. Previously, I was only able to convert their contents to a text format by the Xcalibur File Converter, and the output text was a pretty verbose format. I wrote a little C# code to go from there to msAlign (a format very similar to MGF): https://github.com/dtabb73/XCalibur-Text-2-msAlign.

caetera commented 2 years ago

The new version can generate proper MGF files, the precursor charge state is, however, missing. It seems that trailer is the only source of precursor charge information. In some cases it should be possible to detect the charge state from the previous MS1 scan (this is, however, not implemented). MS1 scans in the file you shared look deisotoped and Xcalibur seem to discard the initial charge information, thus, I doubt it will be possible to detect precursor charge state from "Xtracted" files. You are welcome to check the version 1.4.1 (pre-release), and see if it works for you.

dtabb73 commented 2 years ago

Thank you for this rapid work! I am using the command line with these options: "-f 0 -P -L 1-"

I have a couple points of feedback:

  1. Even though I specify the inclusion of MS scans ("-L 1-"), the MGF excludes them without producing a warning or error.
  2. The warning resulting from the missing trailer could cause a user to believe that the MS/MS scan is excluded from the output, but the MS/MS is still reported. Given that some experiments lack these entirely, it might be nice if they can be bundled: "900 exceptions of this type were generated:".
  3. It is frustrating that the precursor charges are absent since I need those for mass estimates for TopPIC identification. I wrote some clumsy code for this purpose in my C# tool, but I need to make it useful in both neutral ('M') and charged ('MH+') deconvolutions. I frequently find that I cannot deduce the charge on the basis of the preceding MS scan, anyway.

Thank you again!

dtabb73 commented 2 years ago

Oh, I had one other question; do the RAW files make any indication of which MS/MS scans are combined to produce the "Average Spectra" seen in the MSn Browser of FreeStyle? These groupings are not apparent in the text conversion I produce in Xcalibur File Converter.

caetera commented 2 years ago

Hi @dtabb73, MGF format omits MS1 scans by design. To the best of my knowledge, MGF is intended only for fragment spectra, thus, having MS1 scans will create unexpected entries in MGF file. There is, indeed, no warning, since "it should never be necessary". It is possible, to include MS1 scans into MGF export, however, I believe it is better to do with "explicit" command line key.

I will try to think of a better wording for the warning message, so it will be more evident, that the spectrum has not been dropped.

Another, however rather ugly, solution might be to use original file (i.e. before the Xtract processing) to get the charge states, if the scan indexes are preserved it should be very easy to map fragmentation spectra between the two files.

I am not sure about the average spectra, were there any examples in the files you shared? If there was any, I can check if there is anything specific about these scans. I am sorry, but often it is much easier to check things in RAW file itself, since there is not too many details in the documentation provided with Thermo libraries. Jim Shofstahl (I believe he is still the one supporting the libraries) might know more.

dtabb73 commented 2 years ago

I am sad to report that the FreeStyle Xtract-produced RAWs do not preserve scan numbers. I speculate that the new RAW files skip over MS or MS/MS scans that do not contain any deconvolved masses.

Since my goal is to produce a TopPIC-ready msAlign file, I think I will need to start from your tool's mzML output; that way I have both MS and MS/MS scans. I can ask which deconvolved masses from the preceding MS scan are charge multiples of the precursor m/z.

Yes, Jim Shofstahl and I have traded many messages about this challenge. He has been very informative, and he recommended your work to me.

Yes, if you open the Xtract.RAW files I've mentioned in FreeStyle and look at the MSn Browser in the left pane, you'll see that it is grouping sets of MS/MS as Average Spectra. I'm just unsure on the rules it uses for that combination and the algorithm it uses to produce a composite peak list from the component deconvolved MS/MS.

caetera commented 2 years ago

From what I can see the spectra in MSn Browser are not stored in RAW file, but generated on the fly based on value of the precursor mass, i.e. all scans are grouped by precursor mass using a user-defined mass tolerance (for example, 0.5 amu) and then averaged. The averaging of scans is implemented by the RAW file parsing libraries (not included in TRFP, though), but I could not find anything that implements grouping of scans based on precursor (that is not that difficult to implement, though).

caetera commented 1 year ago

In the new release (1.4.1 from today), the total number of errors and warnings is reported at the end of the processing (no grouping by the type, though) as an info message and as the exit code. Some of the error and warning messages have been rewritten and, hopefully, are more digestible now.

@dtabb73, do you believe this issue can be closed or there is still something that I have forgot?