DedalusProject / dedalus

A flexible framework for solving PDEs with modern spectral methods.
http://dedalus-project.org/
GNU General Public License v3.0
513 stars 121 forks source link

Failing parallel execution of Rayleigh Benard example #301

Closed LTMeyer closed 2 months ago

LTMeyer commented 2 months ago

Context

I installed Dedalus version 3.0.2 via the suggested pip command after having installed MPI and FFTW3 manually.

Installation process ```bash # OpenMPI 5.0.5 install ./configure make -j sudo make install # FFTW3 3.3.10 install ./configure CC=mpicc CXX=mpicxx F77=mpif90 MPICC=mpicc MPICXX=mpicxx --enable-shared --enable-mpi --enable-threads --enable-openmp make -j sudo make install # Install Dedalus CC=mpicc pip3 install --no-cache --no-build-isolation dedalus ```

I took the rayleigh_bernard.py example matching the Dedalus version I have installed. I tried to run the example in parallel using the suggested command. I got the error below. Note that running without MPI but directly with Python the script terminates successfully.


mpiexec -n 4 python rayleigh_benard.py 

Error

File "dedalus/core/transposes.pyx", line 86, in dedalus.core.transposes.FFTWTranspose.init ValueError: Buffer dtype mismatch, expected 'int' but got 'long'

Full Error Log > wloc/linux: Ignoring PCI device with non-16bit domain. Pass --enable-32bits-pci-domain to configure to support such devices (warning: it would break the library ABI, don't enable unless really needed). > PMIx was unable to find a usable compression library on the system. We will therefore be unable to compress large data streams. This may result in longer-than-normal startup times and larger memory footprints. We will continue, but strongly recommend installing zlib or a comparable compression library for better user experience. > You can suppress this warning by adding "pcompress_base_silence_warning=1" to your PMIx MCA default parameter file, or by adding "PMIX_MCA_pcompress_base_silence_warning=1" to your environment. > 2024-08-08 15:56:13,930 subsystems 0/4 INFO :: Building subproblem matrices 1/32 (~3%) Elapsed: 0s, Remaining: 1s, Rate: 2.4e+01/s 2024-08-08 15:56:13,988 subsystems 0/4 INFO :: Building subproblem matrices 4/32 (~12%) Elapsed: 0s, Remaining: 1s, Rate: 4.0e+01/s 2024-08-08 15:56:14,065 subsystems 0/4 INFO :: Building subproblem matrices 8/32 (~25%) Elapsed: 0s, Remaining: 1s, Rate: 4.5e+01/s 2024-08-08 15:56:14,142 subsystems 0/4 INFO :: Building subproblem matrices 12/32 (~38%) Elapsed: 0s, Remaining: 0s, Rate: 4.7e+01/s 2024-08-08 15:56:14,220 subsystems 0/4 INFO :: Building subproblem matrices 16/32 (~50%) Elapsed: 0s, Remaining: 0s, Rate: 4.8e+01/s 2024-08-08 15:56:14,302 subsystems 0/4 INFO :: Building subproblem matrices 20/32 (~62%) Elapsed: 0s, Remaining: 0s, Rate: 4.8e+01/s 2024-08-08 15:56:14,379 subsystems 0/4 INFO :: Building subproblem matrices 24/32 (~75%) Elapsed: 0s, Remaining: 0s, Rate: 4.9e+01/s 2024-08-08 15:56:14,457 subsystems 0/4 INFO :: Building subproblem matrices 28/32 (~88%) Elapsed: 1s, Remaining: 0s, Rate: 4.9e+01/s 2024-08-08 15:56:14,535 subsystems 0/4 INFO :: Building subproblem matrices 32/32 (~100%) Elapsed: 1s, Remaining: 0s, Rate: 4.9e+01/s 2024-08-08 15:56:14,544 __main__ 0/4 INFO :: Starting main loop 2024-08-08 15:56:14,579 __main__ 3/4 ERROR :: Exception raised, triggering end of main loop. 2024-08-08 15:56:14,579 __main__ 2/4 ERROR :: Exception raised, triggering end of main loop. 2024-08-08 15:56:14,579 __main__ 1/4 ERROR :: Exception raised, triggering end of main loop. 2024-08-08 15:56:14,579 __main__ 0/4 ERROR :: Exception raised, triggering end of main loop. 2024-08-08 15:56:14,580 solvers 0/4 INFO :: Final iteration: 0 2024-08-08 15:56:14,580 solvers 0/4 INFO :: Final sim time: 0.0 Traceback (most recent call last): File "/home/Documents/dedalus/rayleigh_benard.py", line 122, in Traceback (most recent call last): File "/home/Documents/dedalus/rayleigh_benard.py", line 122, in Traceback (most recent call last): File "/home/Documents/dedalus/rayleigh_benard.py", line 122, in Traceback (most recent call last): File "/home/Documents/dedalus/rayleigh_benard.py", line 122, in solver.step(timestep) solver.step(timestep) solver.step(timestep) solver.step(timestep) File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 646, in step File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 646, in step File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 646, in step File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 646, in step self.enforce_hermitian_symmetry(self.state) self.enforce_hermitian_symmetry(self.state) self.enforce_hermitian_symmetry(self.state) self.enforce_hermitian_symmetry(self.state) File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 633, in enforce_hermitian_symmetry File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 633, in enforce_hermitian_symmetry File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 633, in enforce_hermitian_symmetry File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 633, in enforce_hermitian_symmetry f.change_scales(f.domain.dealias) f.change_scales(f.domain.dealias) f.change_scales(f.domain.dealias) f.change_scales(f.domain.dealias) File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 613, in change_scales File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 613, in change_scales File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 613, in change_scales File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 613, in change_scales self.require_coeff_space(axis) self.require_coeff_space(axis) self.require_coeff_space(axis) self.require_coeff_space(axis) File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 659, in require_coeff_space File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 659, in require_coeff_space File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 659, in require_coeff_space File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 659, in require_coeff_space self.towards_coeff_space() self.towards_coeff_space() self.towards_coeff_space() File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 641, in towards_coeff_space self.towards_coeff_space() File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 641, in towards_coeff_space File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 641, in towards_coeff_space File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/field.py", line 641, in towards_coeff_space self.dist.paths[index-1].decrement([self]) self.dist.paths[index-1].decrement([self]) self.dist.paths[index-1].decrement([self]) self.dist.paths[index-1].decrement([self]) File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 784, in decrement File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 784, in decrement File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 784, in decrement File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 784, in decrement self.decrement_single(*fields) self.decrement_single(*fields) self.decrement_single(*fields) self.decrement_single(*fields) File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 807, in decrement_single File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 807, in decrement_single File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 807, in decrement_single File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 807, in decrement_single plan = self._single_plan(field) plan = self._single_plan(field) plan = self._single_plan(field) ^^^^^^^^^^^^^^^^^^^^^^^^ plan = self._single_plan(field) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 744, in _single_plan ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 744, in _single_plan File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 744, in _single_plan ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 744, in _single_plan return self._plan(ncomp, sub_shape, chunk_shape, field.dtype) return self._plan(ncomp, sub_shape, chunk_shape, field.dtype) return self._plan(ncomp, sub_shape, chunk_shape, field.dtype) return self._plan(ncomp, sub_shape, chunk_shape, field.dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/tools/cache.py", line 86, in __call__ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/tools/cache.py", line 86, in __call__ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/tools/cache.py", line 86, in __call__ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/tools/cache.py", line 86, in __call__ result = self.function(*args, **kw) result = self.function(*args, **kw) result = self.function(*args, **kw) result = self.function(*args, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 737, in _plan ^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 737, in _plan File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 737, in _plan ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/Documents/Software/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/distributor.py", line 737, in _plan return TransposePlanner(full_sub_shape, full_chunk_shape, dtype, axis+1, self.comm_sub) return TransposePlanner(full_sub_shape, full_chunk_shape, dtype, axis+1, self.comm_sub) return TransposePlanner(full_sub_shape, full_chunk_shape, dtype, axis+1, self.comm_sub) return TransposePlanner(full_sub_shape, full_chunk_shape, dtype, axis+1, self.comm_sub) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "dedalus/core/transposes.pyx", line 86, in dedalus.core.transposes.FFTWTranspose.__init__ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "dedalus/core/transposes.pyx", line 86, in dedalus.core.transposes.FFTWTranspose.__init__ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "dedalus/core/transposes.pyx", line 86, in dedalus.core.transposes.FFTWTranspose.__init__ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "dedalus/core/transposes.pyx", line 86, in dedalus.core.transposes.FFTWTranspose.__init__ ValueError: Buffer dtype mismatch, expected 'int' but got 'long' > During handling of the above exception, another exception occurred: ValueError: Buffer dtype mismatch, expected 'int' but got 'long' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): ValueError: Buffer dtype mismatch, expected 'int' but got 'long' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): ValueError: Buffer dtype mismatch, expected 'int' but got 'long' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > Traceback (most recent call last): File "/home/Documents/dedalus/rayleigh_benard.py", line 133, in File "/home/Documents/dedalus/rayleigh_benard.py", line 133, in File "/home/Documents/dedalus/rayleigh_benard.py", line 133, in File "/home/Documents/dedalus/rayleigh_benard.py", line 133, in solver.log_stats() solver.log_stats() solver.log_stats() solver.log_stats() File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 706, in log_stats File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 706, in log_stats File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 706, in log_stats File "/home/Documents/miniconda3/envs/dedalus/lib/python3.12/site-packages/dedalus/core/solvers.py", line 706, in log_stats logger.info(f"Setup time (init - iter 0): {self.start_time:{format}} sec") logger.info(f"Setup time (init - iter 0): {self.start_time:{format}} sec") logger.info(f"Setup time (init - iter 0): {self.start_time:{format}} sec") ^^^^^^^^^^^^^^^ logger.info(f"Setup time (init - iter 0): {self.start_time:{format}} sec") ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ AttributeError: 'InitialValueSolver' object has no attribute 'start_time' AttributeError: 'InitialValueSolver' object has no attribute 'start_time' AttributeError: 'InitialValueSolver' object has no attribute 'start_time' AttributeError: 'InitialValueSolver' object has no attribute 'start_time' > prterun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

How can I fix this error? I am wondering whether it is an incorrect installation of the libraries (either OpenMPI or FFTW3 have been compiled with missing or improper options).

LTMeyer commented 2 months ago

I downgraded the scipy version from 1.14.0 to 1.12.0 and the error disappeared.

The changelog of dedalus version 3.0.2, which mentions the lower version of scipy gave me the idea to downgrade the scipy version. I still don't know what causes the issue though.

csskene commented 2 months ago

Did numpy also downgrade to be <2.0? I think the Cython changes in numpy 2.0 may cause the issues you've seen.

LTMeyer commented 2 months ago

Did numpy also downgrade to be <2.0? I think the Cython changes in numpy 2.0 may cause the issues you've seen.

Yes the downgrade of scipy also enforces the downgrade of numpy to verson 1.26.4.

Would you mind to give me some pointers to educate myself and understand better the origin in the error? How do the Cython changes in numpy and scipy result in the buffer error?

Is this error to be fixed in Dedalus or is it dependant on a fix from scipy/numpy? If so, should the version of scipy be specified in the dependencies?

csskene commented 2 months ago

I think it's related to this. I guess it's not really Cython itself, but how numpy interfaces with Cython.

The most up-to-date Dedalus master version on github works with the newest version of scipy. However, as far as I'm aware it's not yet updated to work with numpy 2.0, and this would require changes to Dedalus.

LTMeyer commented 2 months ago

The most up-to-date Dedalus master version on github works with the newest version of scipy. However, as far as I'm aware it's not yet updated to work with numpy 2.0, and this would require changes to Dedalus.

From SciPy's pyproject.toml although it works well with NumPy 1, it requires NumPy 2 by default.

I think it's related to this. I guess it's not really Cython itself, but how numpy interfaces with Cython.

If I understand correctly a fix for Dedalus to support NumPy 2 would be to update the types of the arrays in transpose.pyx.

Changing the types of the types of the arrays seems indeed to remove the error. However if all the arrays are declared as long, it may break backward compatibility with NumPy 1.

csskene commented 2 months ago

Just to add a little to this, I've tried a few things and think I have isolated the source of the problem. In transposes.pyx chunk_shape is a tuple with dtype=np.int64. This eventually causes B2*ranks on line 85 to become an int64 as well due to how numpy 2.0 now promotes data types. With numpy<2 the data promotion is different and B2*ranks becomes an int32 (hence no error here). Whilst setting all arrays to long would fix the issue as you say, perhaps something like casting chunk_shapes to int32's at the top of transposes.pyx is easier and maintains the original intended dtypes?

kburns commented 2 months ago

Hmm I'm a little confused because the conda-forge feedstock for Dedalus is currently successfully creating and testing builds with numpy > 2, with scipy pinned < 1.14.

csskene commented 2 months ago

The tests all pass for me too. I think this is because the problem only occurs in parallel, for example running the Rayleigh-Bénard example with four processors causes it.

LTMeyer commented 2 months ago

The tests all pass for me too. I think this is because the problem only occurs in parallel, for example running the Rayleigh-Bénard example with four processors causes it.

I confirm the issue only occurs while running in parallel. Sequential invocation of Rayleigh-Bénard works fine. Using mpi however the example failed with the error described above.

Whilst setting all arrays to long would fix the issue as you say, perhaps something like casting chunk_shapes to int32's at the top of transposes.pyx is easier and maintains the original intended dtypes?

I think casting the problematic data to the correct data type is indeed a good idea.

kburns commented 2 months ago

Thank you both for digging in to this. I just pushed a fix in 02cdaec that I think should take care of it.

LTMeyer commented 2 months ago

Thank you both for digging in to this. I just pushed a fix in 02cdaec that I think should take care of it.

Thank you. I've tried again with your commit and numpy 2.0.1 and scipy 1.14. There was no more issue while running the Rayleigh-Bénard example in parallel.

I'm thus closing the issue as you've fixed it.