Open RTSandberg opened 5 months ago
Thank you, @RTSandberg !
For reproducibility, can you please add the OS you used, versions of Python, pyAMReX, PyTorch, MPI flavor and version, and mpi4py version?
If we can reduce this problem to a pure mpi4py
+ PyTorch
issue, then we could also report this upstream in PyTorch
: https://github.com/pytorch/pytorch/issues
On my local machine, PyTorch has some internal multithreaded functionality that doesn't get along with AMReX. Unless I set PyTorch.set_num_threads(1 or 2), then the attached script will hang when the neural network tries to set its initial parameters.
This script downloads some neural network parameters from Zenodo archive to then load them, and the
load_state_dict
function is the specific point of failure.pytorch_amrex_hang_reproducer_v2.py.txt