DedalusProject / dedalus

A flexible framework for solving PDEs with modern spectral methods.
http://dedalus-project.org/
GNU General Public License v3.0
492 stars 115 forks source link

Gather file handlers can overflow for large simulations #226

Open evanhanders opened 1 year ago

evanhanders commented 1 year ago

Hi everyone,

I'm trying to run a pretty large simulation (~1024x512x1024 or so) and my default file handlers are 'gather' file handlers. During the task evaluation (presumably during the checkpoint -- for the largest fields), I'm running into the following error:

File "compressible_dynamics.py", line 234, in solver.step(timestep) File "/nobackupp16/swbuild/eanders/conda_install/src/dedalus-d3/dedalus/core/solvers.py", line 645, in step self.timestepper.step(dt, wall_elapsed) File "/nobackupp16/swbuild/eanders/conda_install/src/dedalus-d3/dedalus/core/timesteppers.py", line 141, in step evaluator.evaluate_scheduled(wall_time=wall_time, timestep=dt, sim_time=sim_time, iteration=iteration) File "/nobackupp16/swbuild/eanders/conda_install/src/dedalus-d3/dedalus/core/evaluator.py", line 106, in evaluate_scheduled self.evaluate_handlers(scheduled_handlers, wall_time=wall_time, sim_time=sim_time, iteration=iteration, kw) File "/nobackupp16/swbuild/eanders/conda_install/src/dedalus-d3/dedalus/core/evaluator.py", line 165, in evaluate_handlers handler.process(kw) File "/nobackupp16/swbuild/eanders/conda_install/src/dedalus-d3/dedalus/core/evaluator.py", line 574, in process self.write_task(file, task) File "/nobackupp16/swbuild/eanders/conda_install/src/dedalus-d3/dedalus/core/evaluator.py", line 626, in write_task data = out.gather_data() File "/nobackupp16/swbuild/eanders/conda_install/src/dedalus-d3/dedalus/core/field.py", line 747, in gather_data pieces = self.dist.comm.gather(self.data, root=root) File "mpi4py/MPI/Comm.pyx", line 1578, in mpi4py.MPI.Comm.gather File "mpi4py/MPI/msgpickle.pxi", line 773, in mpi4py.MPI.PyMPI_gather File "mpi4py/MPI/msgpickle.pxi", line 778, in mpi4py.MPI.PyMPI_gather File "mpi4py/MPI/msgpickle.pxi", line 191, in mpi4py.MPI.pickle_allocv File "mpi4py/MPI/msgpickle.pxi", line 182, in mpi4py.MPI.pickle_alloc SystemError: Negative size passed to PyBytes_FromStringAndSize

I searched around a bit and found this lead at the mpi4py google group

Seems like the lowercase gather() uses pickle and that's what's causing the problem. Using uppercase Gather could get around this but requires a bit more preparation. I don't have the bandwidth to deal with this right now, but wanted to bring it up while I'm thinking about it!

kburns commented 8 months ago

It looks like this may now be fixed in MPI 4.0 -- some discussion on mpi4py here: https://github.com/mpi4py/mpi4py/issues/23. I think this is a good reason to try making gather the default for now. And in the future we can still move to the uppercase/vector versions for better performance. Any thoughts @jsoishi ?