Closed mwalzer closed 4 years ago
This is expected behaviour at present - a whole run is stored in memory so the memory footprint is large. There are a few potentia solutions to this - see issue #21
I probably misunderstood something along the way. How did you test Yamato for real-world scenarios? In case we can't go for beyond SWATH files beyond 5GB on a reasonable machine, I think this is a blocker.
What do you call a "reasonable" machine? Our HPC cluster machines are 16-core, 256G or 512G memory. For any "real" HPC, this is a reasonable machine.
@marinaPauw 's test file have ~20,000 scans in, and is running Yamato OK for them on a low RAM box. A 22GB profile MZML will have ~250,000 scans in and this is where our problems arise on a box with <32GB RAM.
There are a few ways of handling this - presently, "run on a box with enough RAM for your files" is the suggested workaround. The below are probably capable of significantly reducing the RAM burden:
So I tried and failed with the same config that runs the OpenSWATH analysis in our workflow. This is what I would call reasonable, that is 16-32GB mem. So if you need an HPC cluster to do QC then our use-case seems more removed than initially discussed.
OpenSWATH works by burning a whole load of disk i/o early in its run to split the input file into n intermediate files (1/window), then processing each one independently, then merging the results. That disk phase reduces RAM usage at the expense of disk usage and a heavy i/o load. We could look at spilling spectra to, for example, a memory mapped disk file using a heavily custom Spectrum class - would you be comfortable with a temp file of a similar size to the mzml?
On Wed, 4 Dec 2019, 14:31 Mathias Walzer, notifications@github.com wrote:
So I tried and failed with the same config that runs the OpenSWATH analysis in our workflow. This is what I would call reasonable, that is 16-32GB mem. So if you need an HPC cluster to do QC then our use-case seems more removed than initially discussed.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PaulBrack/Yamato/issues/80?email_source=notifications&email_token=AE4XVPQFYCL3AFVJTD3CVDDQW65NBA5CNFSM4JVHEXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5GZFY#issuecomment-561671319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4XVPQ4WL4XZMHVNDIFJOTQW65NBANCNFSM4JVHEXSQ .
In our lab space is definitely an issue, but then again of all the files I have received from collaborators or downloaded from repositories, none have been larger than 6GB and all have between 15000-60000 scans for which the current way of running is not a problem. Perhaps a choice between the two via arguments?
Kind regards, Marina Kriek (née Pauw)
On Wed, Dec 4, 2019 at 4:54 PM Peter Crowther notifications@github.com wrote:
OpenSWATH works by burning a whole load of disk i/o early in its run to split the input file into n intermediate files (1/window), then processing each one independently, then merging the results. That disk phase reduces RAM usage at the expense of disk usage and a heavy i/o load. We could look at spilling spectra to, for example, a memory mapped disk file using a heavily custom Spectrum class - would you be comfortable with a temp file of a similar size to the mzml?
On Wed, 4 Dec 2019, 14:31 Mathias Walzer, notifications@github.com wrote:
So I tried and failed with the same config that runs the OpenSWATH analysis in our workflow. This is what I would call reasonable, that is 16-32GB mem. So if you need an HPC cluster to do QC then our use-case seems more removed than initially discussed.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/PaulBrack/Yamato/issues/80?email_source=notifications&email_token=AE4XVPQFYCL3AFVJTD3CVDDQW65NBA5CNFSM4JVHEXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5GZFY#issuecomment-561671319 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE4XVPQ4WL4XZMHVNDIFJOTQW65NBANCNFSM4JVHEXSQ
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PaulBrack/Yamato/issues/80?email_source=notifications&email_token=AKP4IQZ5O6TILNDEF24X2DLQW7ACFA5CNFSM4JVHEXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5JFBA#issuecomment-561681028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKP4IQY6E2YSIHRPAMFXVIDQW7ACFANCNFSM4JVHEXSQ .
Looking at the current algorithm, there are a couple of points to note:
Blanking the spectrum seems like a good idea.
The reason behind the two-phase BasePeak objects is that with the forward parser we do not know ahead of time which basepeaks will be present. Therefore we cannot collect spectra for them before we pick them up as basepeaks and if we collect spectrum only after we have picked it up as a basepeak our chromatogram metrics will not reflect the true shape of the peaks.
Inside the mzml there is an attribute "index"(under
@mwalzer I've written a low RAM solution for this problem by caching to disk - it's slower and more IO intensive but I think it's a reasonable solution for low RAM environments.
I'll be including this as a switch on input and merging today or tomorrow.
@mwalzer please test with --cacheSpectraToDisk true
Each spectra is serialised using a protocol buffer and stored to disk in its own file in the temp folder as it's read. If these spectra are needed later, they're deserialised for just as long as they're needed.
In addition, the thread queue is set at a max size of 2000, which limits RAM use to around 1GB when I've tested.
Closing as per #21
The naive attempt of feeding a big file (22G) results in wild memory swapping and (probably) indefinite execution times. In a memory controlled environment, the execution gets killed for overstepping (generous) memory boundaries.