firedrakeproject / gusto

Three dimensional atmospheric dynamical core using the Gung Ho numerics.
http://firedrakeproject.org/gusto/
MIT License
12 stars 11 forks source link

Entire coordinate field is sent to one MPI rank for NetCDF outputting. #556

Open JHopeCollins opened 1 month ago

JHopeCollins commented 1 month ago

Python parallel NetCDF does not support NetCDF4 features. To get around this, the entire coordinate field must be communicated to a single MPI rank to use NetCDF outputting. This happens in Coordinates.register_space: https://github.com/firedrakeproject/gusto/blob/bd9e2fe0be7d7bc872be654a9d6ac701013a81c7/gusto/core/coordinates.py#L73

This is ok for smaller jobs, but is impractical/impossible for larger resolutions or processor counts. However, the Domain will always register the DG space: https://github.com/firedrakeproject/gusto/blob/bd9e2fe0be7d7bc872be654a9d6ac701013a81c7/gusto/core/domain.py#L147

This means that even if you don't use the NetCDF outputting, the coordinate field will still be collected on one rank. Ideally this collection would be optional. Either by telling the Domain explicitly, or even better by delaying the register_space call until it is actually needed. Then if dump_nc=False is passed to the OutputParameters then register_space is never called. This might go in this method, but I'm not familiar enough to really tell. https://github.com/firedrakeproject/gusto/blob/bd9e2fe0be7d7bc872be654a9d6ac701013a81c7/gusto/core/io.py#L714

tommbendall commented 1 month ago

Thinking about longer term solutions to parallel outputting to solve this, here are some first thoughts on different options:

  1. Having each rank write out to its own netCDF file. For plotting we'd presumably then need a routine to stitch these files back together.
  2. Send data to the first rank in batches, e.g. for data on an extruded mesh, send it level-by-level. The aim would be to avoid the first rank needing to allocate memory for the whole data array. But this doesn't necessarily help with any communications bottlenecks.
  3. Using pnetcdf. This would likely involve restructuring the contents of the output file as pnetcdf doesn't support several HDF5 features. It would feel odd not to me not use the HDF5 standard...

This is also relevant to #378 but posting here as this feels a more relevant place right now. Option 1 seems the most sensible to me ...

colinjcotter commented 1 month ago

I think that it would be a lot more sustainable (for hpc installation etc) to move Gusto over to checkpoint files for IO, and then develop offline tools for transferring to and from netcdf as needed.

JHopeCollins commented 1 month ago

Of the three options that Tom identified, I agree that option 1 is the best. Option 2 gets around the out-of-memory errors but will scale poorly for larger jobs.

But, I do think that Colin's suggestion that checkpoint files are more sustainable. It would also by more Firedrake-onic (is that a word? Pythonic is. Fire-draconic maybe?), as well as being the more efficient and standard approach - dump data to disk as quickly as possible during the compute job when you're using lots of nodes, then do the postprocessing later on a smaller number of nodes.

It would also make it easier a) for other people who might have existing postprocessing workflows that use plain Firedrake rather than Gusto, and b) to compare to results from other Firedrake libraries.

As an aside, postprocessing large files from previous runs is one of the reasons most HPC machines have high memory nodes, which usually have at least twice the amount of RAM of the usual nodes. So that's one option for doing data processing that relies on reductions onto a single rank.

JDBetteridge commented 1 month ago

I'm going to stick my oar in here and disagree with @tommbendall and @JHopeCollins: I think option 1 is actually a bad idea as I don't think it will scale well on many systems, and it doesn't leverage any of the features of MPI-IO that enable the scalable writing of files. I think the "right" solution for scalability is either:

  1. Use the existing parallel IO available in netcdf4.
  2. Use Firedrake HDF5 checkpointing as Colin suggested.

    Both of these options will utilise MPI-IO and hopefully prevent performance and scalability issues. Happy to expand on these points if you want more information.

colinjcotter commented 1 month ago

Only downside of Firedrake hd5 is that it creates errors on non affine meshes. So we should really work out how to fix that.

colinjcotter commented 1 month ago

although we can keep a hedgehog mesh around for such situations.

tommbendall commented 1 month ago

Thanks for the discussion all!

I hadn't realised that we might be able to use the existing parallel option with netCDF4. That looks like it should be straightforward so I'll give that a try.

Since we already have checkpointing capability, it should also be straightforward to allow people to only write diagnostic output to Firedrake checkpoint files if they would prefer.

I would definitely want to keep some form of online netCDF outputting, rather than post-processing to do this, as setting that up sounds like more work!