Closed ax3l closed 1 year ago
Hi @ax3l - yes, as you have noticed this code really is intended to work on data that can be loaded into memory. A work around would be to load and process data one 'chunk' at a time - this is already possible with the IAEA
DataLoader, and could be (but has not been) implemented for other data loaders.
Early on I considered whether I should try and be more abstract framework for much larger datasets, and I had a look at polars instead of pandas, which could have enabled a 'lazy' evaluation. But ultimately, all the data I work with easily fits into memory (albeit my workstation has 128 Gb of RAM :-P) and I was sort of trying to solve problems I didn't have, so I just decided to keep it simple...
DASK looks interesting. At first glance - it seems more geared towards parallelizing operations, rather than memory management?
Thank you for the details!
Yes, I think with pandas you might already have some support for chunked operations and upgrades to the mentioned backends could enable this in the future.
For DASK: parallelization includes memory management; often limited shared memory per node is the driving reason why one parallelizes :)
Hey @ax3l - I won't realistically be addressing this concern any time soon, but I think it is a very valid point - as such, I've added a page to the docs called limitations, which details this as what I think is the major limitation of this code at present...
This is perfect and great scoping guidance for users and potential future directions! Thanks a lot.
I am closing this as part of the JOSS review, but feel free to reopen it if you like to keep it as a issue for tracking potential future developments/contributions.
Thank you for the JOSS submission in https://github.com/openjournals/joss-reviews/issues/5375 .
This is a follow-up question to #156.
In the design of this package, what are the envisioned data sizes for phase space data to be processed? Up to the size of a laptop RAM/single node?
I was looking at https://bwheelz36.github.io/ParticlePhaseSpace/new_data_loader.html
and am wondering if not most of the operations here are map-reduce operations and could be implemented to stream over arbitrary data sizes, e.g., if large simulation data is being processed?
I did some experiments on processing such data with Dask: https://github.com/openPMD/openPMD-api/pull/963#issuecomment-873350174
and wonder if something similar could be used as the backend here to scale up? :rocket: