PaulBrack / Yamato

SWATH-QC metrics
Apache License 2.0
1 stars 1 forks source link

Yamato fails for big files. Are files loaded in-full to memory? #80

Closed mwalzer closed 4 years ago

mwalzer commented 4 years ago

The naive attempt of feeding a big file (22G) results in wild memory swapping and (probably) indefinite execution times. In a memory controlled environment, the execution gets killed for overstepping (generous) memory boundaries.

Yamato.Console -i napedro_L120420_010_SW.mzML
2019-12-04 10:39:37.0159|INFO|Yamato.Console.Program|Verbose output selected: enabled logging for all levels
2019-12-04 10:39:37.0462|INFO|Yamato.Console.Program|Loading file: napedro_L120420_010_SW.mzML

> exit
PaulBrack commented 4 years ago

This is expected behaviour at present - a whole run is stored in memory so the memory footprint is large. There are a few potentia solutions to this - see issue #21

mwalzer commented 4 years ago

I probably misunderstood something along the way. How did you test Yamato for real-world scenarios? In case we can't go for beyond SWATH files beyond 5GB on a reasonable machine, I think this is a blocker.

Ozzard commented 4 years ago

What do you call a "reasonable" machine? Our HPC cluster machines are 16-core, 256G or 512G memory. For any "real" HPC, this is a reasonable machine.

PaulBrack commented 4 years ago

@marinaPauw 's test file have ~20,000 scans in, and is running Yamato OK for them on a low RAM box. A 22GB profile MZML will have ~250,000 scans in and this is where our problems arise on a box with <32GB RAM.

There are a few ways of handling this - presently, "run on a box with enough RAM for your files" is the suggested workaround. The below are probably capable of significantly reducing the RAM burden:

  1. Use an already peak picked MZML (PWiz can do this)
  2. We implement our own peak-picking in Yamato (proof of concept completed)
  3. Do some aggressive thresholding - most of the actual data points are background noise so this might be enough
  4. Take a caching approach as described in #21
mwalzer commented 4 years ago

So I tried and failed with the same config that runs the OpenSWATH analysis in our workflow. This is what I would call reasonable, that is 16-32GB mem. So if you need an HPC cluster to do QC then our use-case seems more removed than initially discussed.

Ozzard commented 4 years ago

OpenSWATH works by burning a whole load of disk i/o early in its run to split the input file into n intermediate files (1/window), then processing each one independently, then merging the results. That disk phase reduces RAM usage at the expense of disk usage and a heavy i/o load. We could look at spilling spectra to, for example, a memory mapped disk file using a heavily custom Spectrum class - would you be comfortable with a temp file of a similar size to the mzml?

On Wed, 4 Dec 2019, 14:31 Mathias Walzer, notifications@github.com wrote:

So I tried and failed with the same config that runs the OpenSWATH analysis in our workflow. This is what I would call reasonable, that is 16-32GB mem. So if you need an HPC cluster to do QC then our use-case seems more removed than initially discussed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PaulBrack/Yamato/issues/80?email_source=notifications&email_token=AE4XVPQFYCL3AFVJTD3CVDDQW65NBA5CNFSM4JVHEXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5GZFY#issuecomment-561671319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4XVPQ4WL4XZMHVNDIFJOTQW65NBANCNFSM4JVHEXSQ .

marinaPauw commented 4 years ago

In our lab space is definitely an issue, but then again of all the files I have received from collaborators or downloaded from repositories, none have been larger than 6GB and all have between 15000-60000 scans for which the current way of running is not a problem. Perhaps a choice between the two via arguments?

Kind regards, Marina Kriek (née Pauw)

On Wed, Dec 4, 2019 at 4:54 PM Peter Crowther notifications@github.com wrote:

OpenSWATH works by burning a whole load of disk i/o early in its run to split the input file into n intermediate files (1/window), then processing each one independently, then merging the results. That disk phase reduces RAM usage at the expense of disk usage and a heavy i/o load. We could look at spilling spectra to, for example, a memory mapped disk file using a heavily custom Spectrum class - would you be comfortable with a temp file of a similar size to the mzml?

On Wed, 4 Dec 2019, 14:31 Mathias Walzer, notifications@github.com wrote:

So I tried and failed with the same config that runs the OpenSWATH analysis in our workflow. This is what I would call reasonable, that is 16-32GB mem. So if you need an HPC cluster to do QC then our use-case seems more removed than initially discussed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/PaulBrack/Yamato/issues/80?email_source=notifications&email_token=AE4XVPQFYCL3AFVJTD3CVDDQW65NBA5CNFSM4JVHEXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5GZFY#issuecomment-561671319 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE4XVPQ4WL4XZMHVNDIFJOTQW65NBANCNFSM4JVHEXSQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PaulBrack/Yamato/issues/80?email_source=notifications&email_token=AKP4IQZ5O6TILNDEF24X2DLQW7ACFA5CNFSM4JVHEXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5JFBA#issuecomment-561681028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKP4IQY6E2YSIHRPAMFXVIDQW7ACFANCNFSM4JVHEXSQ .

Ozzard commented 4 years ago

Looking at the current algorithm, there are a couple of points to note:

marinaPauw commented 4 years ago

Blanking the spectrum seems like a good idea.

The reason behind the two-phase BasePeak objects is that with the forward parser we do not know ahead of time which basepeaks will be present. Therefore we cannot collect spectra for them before we pick them up as basepeaks and if we collect spectrum only after we have picked it up as a basepeak our chromatogram metrics will not reflect the true shape of the peaks.

Inside the mzml there is an attribute "index"(under

) which occurs in both our Sciex, Waters and Thermo files so we could use that if we would like. The scanstartTime property is also sequential so currently we would use that to sort the scans. Or do you mean the order within the binary data array? On Thu, Dec 5, 2019 at 5:44 PM Peter Crowther wrote: > Looking at the current algorithm, there are a couple of points to note: > > - Spectrum objects (the single largest data soak) are not used after a > scan has been put through FindBasePeaks. That method could blank > scan.Spectrum as it generates bp.Spectrum, which would slightly reduce the > footprint. > - Do we have to have one phase to set up BasePeak objects and then a > second to add their SpectrumPoints, or can we do it *either* all in > one go *or* with a concurrent filler that does FindBasePeaks on > anything more than RtTolerance behind the scans currently being read from > the mzML and discards older scan.Spectrum objects? I suspect it's the > latter as we may detect a BasePeak and then want to fill in its contents > from spectra earlier in the run. > - Is there anything in the mzML spec that mandates an order for the > spectra, for example retaining the order of generation, or can spectra > arrive (in theory) in any order? If the latter, it's going to be very > different to optimise this. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . >
PaulBrack commented 4 years ago

@mwalzer I've written a low RAM solution for this problem by caching to disk - it's slower and more IO intensive but I think it's a reasonable solution for low RAM environments.

I'll be including this as a switch on input and merging today or tomorrow.

PaulBrack commented 4 years ago

@mwalzer please test with --cacheSpectraToDisk true

Each spectra is serialised using a protocol buffer and stored to disk in its own file in the temp folder as it's read. If these spectra are needed later, they're deserialised for just as long as they're needed.

In addition, the thread queue is set at a max size of 2000, which limits RAM use to around 1GB when I've tested.

PaulBrack commented 4 years ago

Closing as per #21