compomics / ThermoRawFileParser

Thermo RAW file parser that runs on Linux/Mac and all other platforms that support Mono
Apache License 2.0
189 stars 50 forks source link

Silently failing on some raw files #144

Closed wsnoble closed 2 years ago

wsnoble commented 2 years ago

In using ThermoRFP to process raw files from a number of PRIDE projects, I find that the software reproducibly and silently fails on a small subset of files. The behavior is to run for ~1 s and then produce a zero-length output file. No error status is indicated by the program. I have verified that the download is working properly.

Here are some example raw files that fail to convert from the project PXD027241.

 proteome_GS_E9.raw
 proteome_CB_E8.raw
 proteome_CB_E3.raw
 as_p_e6.raw
 as_p_e13.raw
 Prot_7_180708021614.raw
 Prot_3_180707132051.raw
 P_E7.raw

In constructing a sample command line to show you, I noticed something very strange: the above behavior only happens when I use absolute pathnames. For example, this command gives the behavior described above:

 mono ~/software/ThermoRawFileParser.v1.4.0/ThermoRawFileParser.exe --output_file=/net/noble/vol1/data/crux-datasets/2022siddiqui-new/DDA/P_E7.mgf --logging=0 --format=0 --input=/net/noble/vol1/data/crux-datasets/2022siddiqui-new/DDA/P_E7.raw

However, if I cd to the directory /net/noble/vol1/data/crux-datasets/2022siddiqui-new/DDA and then issue the command without the pathnames, I see something quite different (and more informative):

$ mono ~/software/ThermoRawFileParser.v1.4.0/ThermoRawFileParser.exe --output_file=P_E7.mgf --format=0 --input=P_E7.raw
2022-09-22 09:39:01 INFO Started parsing P_E7.raw
2022-09-22 09:39:01 INFO Processing 1 scans
2022-09-22 09:39:01 ERROR An unexpected error occured while parsing file:P_E7.raw
2022-09-22 09:39:01 ERROR System.IndexOutOfRangeException: The scan number must be >= 0 and <= 0.
  at ThermoFisher.CommonCore.RawFileReader.StructWrappers.VirtualDevices.MassSpecDevice.ThrowScanNumberRangeException () [0x0002a] in <38096e861d444fd896cbda1ff335437f>:0 
  at ThermoFisher.CommonCore.RawFileReader.StructWrappers.VirtualDevices.MassSpecDevice.GetValidIndexIntoScanIndices (System.Int32 scanNumber) [0x00027] in <38096e861d444fd896cbda1ff335437f>:0 
  at ThermoFisher.CommonCore.RawFileReader.StructWrappers.VirtualDevices.MassSpecDevice.GetRetentionTime (System.Int32 spectrum) [0x00000] in <38096e861d444fd896cbda1ff335437f>:0 
  at ThermoFisher.CommonCore.RawFileReader.RawFileAccessBase.RetentionTimeFromScanNumber (System.Int32 scanNumber) [0x0000b] in <38096e861d444fd896cbda1ff335437f>:0 
  at ThermoRawFileParser.Writer.MgfSpectrumWriter.Write (ThermoFisher.CommonCore.Data.Interfaces.IRawDataPlus rawFile, System.Int32 firstScanNumber, System.Int32 lastScanNumber) [0x00093] in <7a789c83cb9d4db2bd14ef83a0b964a9>:0 
  at ThermoRawFileParser.RawFileParser.ProcessFile (ThermoRawFileParser.ParseInput parseInput) [0x000d9] in <7a789c83cb9d4db2bd14ef83a0b964a9>:0 
  at ThermoRawFileParser.RawFileParser.TryProcessFile (ThermoRawFileParser.ParseInput parseInput) [0x00000] in <7a789c83cb9d4db2bd14ef83a0b964a9>:0 

So now my question is two-fold: (1) why does this error message only show up when I don't include pathnames in the command line, and (2) what does this error message mean? Logically, The scan number must be >= 0 and <= 0. makes no sense, right? It seems like the error message is saying that the scan number must be equal to zero!

The real question, of course, is how can I go about converting these problematic raw files.

caetera commented 2 years ago

Hi @wsnoble , thank you for reporting this issue. Regarding the second point, I have downloaded a couple of file from your list and they indeed report having 0 spectra (most likely, the files are corrupted), thus, I am afraid, these cannot be converted through TRFP. I have to admit, though, that it is a bug, since we assume that there is at least one spectrum in the file (and trying to read from spectrum at index 1), that is the reason for IndexOutOfRange exception. Regarding the first point, the reason is not that obvious and, thus, I will need to investigate it further.

caetera commented 2 years ago

Hi @wsnoble , the reason for the silent failure is that all output is suppressed by using --logging=0 in the first case (with absolute path), while the return codes are, indeed, not always correct (see #140).

wsnoble commented 2 years ago

Regarding the silent failure, I had --logging=0 on in both cases, and it failed silently only when I use absolute pathnames. It seems like if you want to not report an error message on failure (which, personally, I think is not a good idea in any case), this behavior should not depend on whether the file name absolute or relative.

caetera commented 2 years ago

Could you, please, check that the issue is indeed connected to relative/absolute path. In the examples you have provided (the opening post) logging is only suppressed when the absolute path is used, in the relative path example logging argument is not provided, thus, the default value (2) is used. I could reproduce the silent failure only with logging=0, both for absolute and relative path cases (tested on Ubuntu 20.04.4/Mono 6.8.0.105 and Windows 10/Native .NET). There is no special treatment for relative/absolute pathes,TRFP fully relies on OS to resolve either of them to a valid data stream. There is, of course, a possibility that the issue is dependent on OS and/or Mono version. Which OS and Mono do you use?

Silent logging is used, internally, when writing to STDOUT and, thus, all output (except the file content) is fully suppressed. Since fixing error codes is in todo list, do you think current behavior of silent logging, given non-zero return code will be sufficient?

wsnoble commented 2 years ago

Darn it, you're right. Sorry for the dumb mistake on my end. I'm not sure why I included logging=0 in only one case, but in retrospect that obviously explains the difference in behavior.

The current behavior is OK, I guess. At some point you might want to have different levels of logging so the user could get more or less verbose output, as needed.

caetera commented 2 years ago

No problem. Check for an empty file (less than 1 scan in total) has been implemented by 8eb3b34. The processing will fail with clear error message and no output will be produced. The change is scheduled for the next release of TRFP.

caetera commented 2 years ago

I have change some of the error reporting and fixed OS exit codes, that now indicate the number of errors (and, optionally, warnings) and implemented two new logging level to preserve only warning (and more severe) and error (and more severe) messages. I will close the issue for now, but feel free to reopen it if necessary.