Open JHopeCollins opened 1 month ago
Thinking about longer term solutions to parallel outputting to solve this, here are some first thoughts on different options:
This is also relevant to #378 but posting here as this feels a more relevant place right now. Option 1 seems the most sensible to me ...
I think that it would be a lot more sustainable (for hpc installation etc) to move Gusto over to checkpoint files for IO, and then develop offline tools for transferring to and from netcdf as needed.
Of the three options that Tom identified, I agree that option 1 is the best. Option 2 gets around the out-of-memory errors but will scale poorly for larger jobs.
But, I do think that Colin's suggestion that checkpoint files are more sustainable. It would also by more Firedrake-onic (is that a word? Pythonic is. Fire-draconic maybe?), as well as being the more efficient and standard approach - dump data to disk as quickly as possible during the compute job when you're using lots of nodes, then do the postprocessing later on a smaller number of nodes.
It would also make it easier a) for other people who might have existing postprocessing workflows that use plain Firedrake rather than Gusto, and b) to compare to results from other Firedrake libraries.
As an aside, postprocessing large files from previous runs is one of the reasons most HPC machines have high memory nodes, which usually have at least twice the amount of RAM of the usual nodes. So that's one option for doing data processing that relies on reductions onto a single rank.
I'm going to stick my oar in here and disagree with @tommbendall and @JHopeCollins: I think option 1 is actually a bad idea as I don't think it will scale well on many systems, and it doesn't leverage any of the features of MPI-IO that enable the scalable writing of files. I think the "right" solution for scalability is either:
Use Firedrake HDF5 checkpointing as Colin suggested.
Both of these options will utilise MPI-IO and hopefully prevent performance and scalability issues. Happy to expand on these points if you want more information.
Only downside of Firedrake hd5 is that it creates errors on non affine meshes. So we should really work out how to fix that.
although we can keep a hedgehog mesh around for such situations.
Thanks for the discussion all!
I hadn't realised that we might be able to use the existing parallel option with netCDF4. That looks like it should be straightforward so I'll give that a try.
Since we already have checkpointing capability, it should also be straightforward to allow people to only write diagnostic output to Firedrake checkpoint files if they would prefer.
I would definitely want to keep some form of online netCDF outputting, rather than post-processing to do this, as setting that up sounds like more work!
Python parallel NetCDF does not support NetCDF4 features. To get around this, the entire coordinate field must be communicated to a single MPI rank to use NetCDF outputting. This happens in
Coordinates.register_space
: https://github.com/firedrakeproject/gusto/blob/bd9e2fe0be7d7bc872be654a9d6ac701013a81c7/gusto/core/coordinates.py#L73This is ok for smaller jobs, but is impractical/impossible for larger resolutions or processor counts. However, the
Domain
will always register the DG space: https://github.com/firedrakeproject/gusto/blob/bd9e2fe0be7d7bc872be654a9d6ac701013a81c7/gusto/core/domain.py#L147This means that even if you don't use the NetCDF outputting, the coordinate field will still be collected on one rank. Ideally this collection would be optional. Either by telling the
Domain
explicitly, or even better by delaying theregister_space
call until it is actually needed. Then ifdump_nc=False
is passed to theOutputParameters
thenregister_space
is never called. This might go in this method, but I'm not familiar enough to really tell. https://github.com/firedrakeproject/gusto/blob/bd9e2fe0be7d7bc872be654a9d6ac701013a81c7/gusto/core/io.py#L714