Open jpoto opened 1 year ago
On LUMI we are able to run on single nodes, but have problems with MPI.
I have attached the outputs for OpenMPI.
qforce.0.log qforce.1.log qforce.2.log qforce.3.log job_gnu_mpi.slurm.txt output.log
For Cray-MPICH I get the same error. output.log
With the help of the Frontier support staff at the Frontier hackathon we were able to localize the problem. The MPI_Improbe and MPI_Imrecv functions are not supported by the current slingshot (libfabric), see also (https://docs.nersc.gov/current/#ongoing-issues).
This is an issue to discuss the deployment of ExaTensor on systems like the LUMI and FRONTIER supercomputer.