Shared memory Interpolation Mesh

gwm17 commented 1 month ago

Creating this post-starting this work, but at least it will be here for some kind of record of what we're doing...

So right now our scaling in terms of parallelization is limited by the memory footprint of the interpolation mesh. This is mostly due to our insistence on loading a mesh per process. This was done in a an effort to avoid having to debug potential race conditions and speed up development at the cost of limiting some performance down the line.

Now it is time to unchain the interpolation mesh and host it as shared memory across all processes. There are a bunch of ways to do this but all of them have some form of drawback.

The first is the obviously not applicable Linux fork CoW method, which doesn't work for us because not everyone uses linux and polars doesn't play nicely with fork.

Alternative is the Python multiprocessing sharedctypes module. This is somewhat promising because you don't need a lock and the memory in principle has no overhead to access. However, there is no explicit determination of when memory is freed, which makes me ... nervous. Also like many of the lock-less methods I describe it requires that we never write to the mesh after it is loaded to the shared space.

Could go with a Python multiprocessing shared_memory module which provides explicit freeing semantics per instance of use which is kinda not exactly what we're looking for. Plus these want you to use a Manager which requires pickling and networking and bleh...

So work in progress, but the benefit could be huge!

gwm17 commented 3 weeks ago

Went with manager approach based on relatively infrequent requesting of the mesh. Some notes from testing.

It is very difficult to accurately track the memory usage as pertains specifically to the mesh. Notably the memory tooling of platforms has difficulty distinguishing if the memory is owned by one of the children or not. For example, on Linux Ubuntu, the built in monitor often shows all of the Spyral processes not owning any memory related to the mesh whatsoever, even though the total memory pressure increases accordingly. Whereas, on Apple Silicon, I see diagnositics that show the correct memory pressure again, but now it looks like each process is still allocating an entire mesh! So it may be good to add to the docs a little blurb about tracking memory pressure.

gwm17 commented 3 weeks ago

Done in v0.9.0

ATTPC / Spyral

Shared memory Interpolation Mesh #144