ProteoWizard / pwiz

The ProteoWizard Library is a set of software libraries and tools for rapid development of mass spectrometry and proteomic data analysis software.
http://proteowizard.sourceforge.net/
Apache License 2.0
215 stars 97 forks source link

Error converting raw files in Docker and Singularity #684

Closed kevinkovalchik closed 4 years ago

kevinkovalchik commented 4 years ago

Hello,

I have been trying to construct a pipeline making use of msconvert in a Singularity image. I have tried the images from proteowizard/pwiz-skyline-i-agree-to-the-vendor-licenses and chambm/pwiz-skyline-i-agree-to-the-vendor-licenses running as converted singularity images as well as in Docker and I am running into the same issue with all of them:

The process encounters this error: [SpectrumWorkerThreads::work] error in thread: [SpectrumList_Thermo::spectrum()] Unknown exception retrieving spectrum "controllerType=0 controllerNumber=1 scan=XXXX", where XXXX is some scan number. It seems to be random in nature, as sometimes a particular file will finish and other times not. When a file fails, the scan number is not reproducible.

Here is the complete output from a docker run:

root@cfdbe56b4176:/Data# wine msconvert ./2017-04-21_BK10_EM_MA_DR3_03b.raw --mzML --filter "peakPicking vendor"
format: mzML 
    m/z: Compression-None, 64-bit
    intensity: Compression-None, 32-bit
    rt: Compression-None, 64-bit
ByteOrder_LittleEndian
 indexed="true"
outputPath: .
extension: .mzML
contactFilename: 
runIndexSet: 

spectrum list filters:
  peakPicking vendor

chromatogram list filters:

filenames:
  .\2017-04-21_BK10_EM_MA_DR3_03b.raw

processing file: .\2017-04-21_BK10_EM_MA_DR3_03b.raw
0039:err:combase:RoGetActivationFactory Failed to find library for L"Windows.Foundation.Diagnostics.AsyncCausalityTracer"
calculating source file checksums
writing output file: .\2017-04-21_BK10_EM_MA_DR3_03b.mzML
[SpectrumList_PeakPicker]: one or more spectra are already centroided, no processing needed
[SpectrumWorkerThreads::work] error in thread: [SpectrumList_Thermo::spectrum()] Unknown exception retrieving spectrum "controllerType=0 controllerNumber=1 scan=2900"

The 0039:err:combase:RoGetActivationFactory Failed to find library for L"Windows.Foundation.Diagnostics.AsyncCausalityTracer" in there seems like it might be an issue as it has something to do with async process control and the error is coming up in a SpectrumThreadWorker, though that is just a guess on my part. Possibly that is something that is always there and is suppressed when passing the -e WINEDEBUG=-all to Docker.

I have tried using many different .raw files. I have only tested on Orbitrap files. As an example, I have tried files from this dataset: "https://www.ebi.ac.uk/pride/archive/projects/PXD004451". They are larger files, but I have also had this problem with small files (<400 MB).

Thanks for any assistance! Kevin

chambm commented 4 years ago

See if this happens if you pass the --singleThreaded option. It probably will. Are you using the latest image?

kevinkovalchik commented 4 years ago

I am using the latest image from docker hub (chambm/pwiz-skyline-i-agree-to-the-vendor-licenses:latest). The --singleThreaded option actually does seem to fix the issue. This is okay for me because the pipeline will run on an HPC and will be parallelized. There do seem to be some other issues, though. I ran it with a few different options, described in the table below.

Options Errors Comments
--mzML --filter 'peakPicking vendor' [SpectrumWorkerThreads::work] error in thread: [SpectrumList_Thermo::spectrum()] Unknown exception retrieving spectrum "controllerType=0 controllerNumber=1 scan=XXX" XXX = some integer which is not reproducible. This causes the program to hang.
--mzML --singleThreaded --filter 'peakPicking vendor' 001d:err:ole:PSFacBuf_CreateProxy Could not commit pages for proxy thunks 001d:err:ole:proxy_manager_create_ifproxy Could not create proxy for interface {5c6fb596-4828-4ed5-b9dd-293dad736fb5}, error 0x8007000e 001d:err:ole:CoUnmarshalInterface IMarshal::UnmarshalInterface failed, 0x8007000e This doesn't crash the process and always happens for the same file, so probably unrelated to the original issue (NOTE: it is not always the same file, see next row). The .mzML file is still generated for this file. I haven't tested it, but the size looks appropriate.
--mzML --singleThreaded 001d:err:ole:PSFacBuf_CreateProxy Could not commit pages for proxy thunks 001d:err:ole:proxy_manager_create_ifproxy Could not create proxy for interface {5c6fb596-4828-4ed5-b9dd-293dad736fb5}, error 0x8007000e 001d:err:ole:CoUnmarshalInterface IMarshal::UnmarshalInterface failed, 0x8007000e Same progression as with --mzML --singleThreaded --filter 'peakPicking vendor'. I see this error, but everything still finishes fine.
--mzML 001d:err:ole:PSFacBuf_CreateProxy Could not commit pages for proxy thunks 001d:err:ole:proxy_manager_create_ifproxy Could not create proxy for interface {5c6fb596-4828-4ed5-b9dd-293dad736fb5}, error 0x8007000e 001d:err:ole:CoUnmarshalInterface IMarshal::UnmarshalInterface failed, 0x8007000e This one again, but this time it happens on the first file converted after approximately the same amount of wall clock time as the above occurrence. That makes me think it is not related to the files at all.
(same) A while after the above error happens all but one thread dies, memory usage goes through the roof and the process gets killed by the OS. This also repeatably happens after the same amount of wall clock time. A picture of the resource usage is pasted below the table.

image

Note that the 001d:err:ole:PSFacBuf_CreateProxy Could not commit pages for proxy thunks 001d:err:ole:proxy_manager_create_ifproxy Could not create proxy for interface {5c6fb596-4828-4ed5-b9dd-293dad736fb5}, error 0x8007000e 001d:err:ole:CoUnmarshalInterface IMarshal::UnmarshalInterface failed, 0x8007000e error also seems to occur when not using the --singleThreaded option, but the program rarely makes it long enough without crashing for it to happen.

You are using RawFileReader, right? Are you using RawFileReaderFactory.CreateThreadManager or RawFileReaderFactory.ReadFile for opening the .raw file? ReadFile isn't designed for multiple threads, so I can see it being an issue if that method were used. That would explain why the crash seems to happen randomly. Better to use CreateThreadManager and then CreateThreadAccessor for each thread.

I have no thoughts on the memory usage issue. Seems like a separate problem.

Hope there was something useful in all of that!

Thanks for the assistance, Kevin

kevinkovalchik commented 4 years ago

I just pulled the proteowizard distribution out of the container and put it in a real Windows environment and it seems to run fine. So if threading is an issue maybe it is a problem with .NET and Wine and not RawFileReader. Which version of RawFileReader are you using? i.e. do you need to have .NET 4.7.2 in there or would 4.6.2 be fine? From what I have seen it is more stable in Wine...

markmipt commented 4 years ago

The same issue with msconvert in Docker. By the way, error appears only with data which contain MS/MS spectra. LC-MS1 spectra processing works fine even without --singleThreaded flag.

chambm commented 4 years ago

The WINE errors are irrelevant and it's why I recommend running with -e WINEDEBUG=-all until you actually have a problem to track down (and then the default WINEDEBUG I put in the Dockerfile is rarely adequate to track the problem, so I should probably just change the Dockerfile to -all). But I did see the weird memory explosion in some of my own testing. However it only happened when I tried to convert multiple RAW files in a single process. When I only convert a single file per invocation it never happened. Are either of you seeing the memory explosion with a single file invocation?

And yes we are using ThreadAccessor:

        rawManager_ = RawFileReaderAdapter::ThreadedFileFactory(managedFilename);
        raw_ = rawManager_->CreateThreadAccessor();
kevinkovalchik commented 4 years ago

I haven't experienced it with a single file invocation. Based on processing different sizes of files with and without peak picking, it seems to depend on how much time has passed or data has been processed/accessed in a single session in some fashion, so it's possible with a very large file it might happen. I don't have anything so large to test it on.

Kevin

On Tue, Nov 12, 2019 at 10:22 AM Matt Chambers notifications@github.com wrote:

The WINE errors are irrelevant and it's why I recommend running with -e WINEDEBUG=-all until you actually have a problem to track down (and then the default WINEDEBUG I put in the Dockerfile is rarely adequate to track the problem, so I should probably just change the Dockerfile to -all). But I did see the weird memory explosion in some of my own testing. However it only happened when I tried to convert multiple RAW files in a single process. When I only convert a single file per invocation it never happened. Are either of you seeing the memory explosion with a single file invocation?

And yes we are using ThreadAccessor:

    rawManager_ = RawFileReaderAdapter::ThreadedFileFactory(managedFilename);
    raw_ = rawManager_->CreateThreadAccessor();

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ProteoWizard/pwiz/issues/684?email_source=notifications&email_token=AD2PTUCV6HFYDYWSUE4RVX3QTLC27A5CNFSM4JDEJDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED2TGGI#issuecomment-552940313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2PTUEGTQN5KPWWU7PHSGTQTLC27ANCNFSM4JDEJDCA .

chambm commented 4 years ago

757 makes --singleThreaded the default for Thermo on Wine and adds a warning in msconvert about converting multiple files at a time.

lbwfff commented 5 months ago

Hi developers, I'm using the docker version of msconvert for file conversion and am experiencing a error, here are my logs

./Syn1-CER-P2_4.mzML
format: mzML 
    m/z: Compression-Zlib, 64-bit
    intensity: Compression-Zlib, 64-bit
    rt: Compression-Zlib, 64-bit
ByteOrder_LittleEndian
 indexed="true"
outputPath: .
extension: .mzML
contactFilename: 
runIndexSet: 

spectrum list filters:
  peakPicking
  zeroSamples removeExtra 1-

chromatogram list filters:

filenames:
  /data/.\Syn1-CER-P2_4.raw

processing file: /data/.\Syn1-CER-P2_4.raw
calculating source file checksums

writing output file: .\Syn1-CER-P2_4.mzML

Error writing run 1:
[SpectrumList_Thermo::spectrum()] Error retrieving spectrum "controllerType=0 controllerNumber=1 scan=13031": [SpectrumList_Thermo::getMultiFillTImes()] Unexpected fill time format: mature end of Content-Length del
Conversion failed for 1 runs in Syn1-CER-P2_4.raw.
Error processing file /data/.\Syn1-CER-P2_4.raw

I used the --singleThreaded parameter but it didn't solve the problem.

for f in $(find ./ -name '*.raw');
do
    f2=${f/.raw/.mzML}
    if [ ! -f $f2 ];
    then
        echo $f2
        docker run --rm -e WINEDEBUG=-all -v /data/ADMS/down/cety_regi/:/data chambm/pwiz-skyline-i-agree-to-the-vendor-licenses:x64 wine msconvert --64 --zlib --singleThreaded --filter "peakPicking" --filter "zeroSamples removeExtra 1-" /data/$f
    fi
done

What is the possible cause of the error and what should I do to fix it

Thanks,

LeeLee

chambm commented 5 months ago

Sounds possibly like a corrupt raw file. Can you share it?

lbwfff commented 5 months ago

I re-downloaded the data and converted it and now it converts fine, it was indeed the file. Thanks for the suggestion!