Data Sizes - Githubissues

bwheelz36 / ParticlePhaseSpace

Import, analysis, and export of particle phase space data

https://bwheelz36.github.io/ParticlePhaseSpace/

GNU General Public License v3.0

13 stars 8 forks source link

Data Sizes #158

Closed ax3l closed 1 year ago

ax3l commented 1 year ago

Thank you for the JOSS submission in https://github.com/openjournals/joss-reviews/issues/5375 .

This is a follow-up question to #156.

In the design of this package, what are the envisioned data sizes for phase space data to be processed? Up to the size of a laptop RAM/single node?

I was looking at https://bwheelz36.github.io/ParticlePhaseSpace/new_data_loader.html

and am wondering if not most of the operations here are map-reduce operations and could be implemented to stream over arbitrary data sizes, e.g., if large simulation data is being processed?

I did some experiments on processing such data with Dask: https://github.com/openPMD/openPMD-api/pull/963#issuecomment-873350174

and wonder if something similar could be used as the backend here to scale up? :rocket:

bwheelz36 commented 1 year ago

Hi @ax3l - yes, as you have noticed this code really is intended to work on data that can be loaded into memory. A work around would be to load and process data one 'chunk' at a time - this is already possible with the IAEA DataLoader, and could be (but has not been) implemented for other data loaders.

Early on I considered whether I should try and be more abstract framework for much larger datasets, and I had a look at polars instead of pandas, which could have enabled a 'lazy' evaluation. But ultimately, all the data I work with easily fits into memory (albeit my workstation has 128 Gb of RAM :-P) and I was sort of trying to solve problems I didn't have, so I just decided to keep it simple...

DASK looks interesting. At first glance - it seems more geared towards parallelizing operations, rather than memory management?

ax3l commented 1 year ago

Thank you for the details!

Yes, I think with pandas you might already have some support for chunked operations and upgrades to the mentioned backends could enable this in the future.

For DASK: parallelization includes memory management; often limited shared memory per node is the driving reason why one parallelizes :)

bwheelz36 commented 1 year ago

Hey @ax3l - I won't realistically be addressing this concern any time soon, but I think it is a very valid point - as such, I've added a page to the docs called limitations, which details this as what I think is the major limitation of this code at present...

ax3l commented 1 year ago

This is perfect and great scoping guidance for users and potential future directions! Thanks a lot.

I am closing this as part of the JOSS review, but feel free to reopen it if you like to keep it as a issue for tracking potential future developments/contributions.