RelMBdev / ExaTENSOR

Basic numerical tensor algebra library for distributed heterogeneous HPC platforms
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Run ExaTensor on CRAY/AMD accellerators #3

Open jpoto opened 1 year ago

jpoto commented 1 year ago

This is an issue to discuss the deployment of ExaTensor on systems like the LUMI and FRONTIER supercomputer.

jpoto commented 1 year ago

On LUMI we are able to run on single nodes, but have problems with MPI.

I have attached the outputs for OpenMPI.

qforce.0.log qforce.1.log qforce.2.log qforce.3.log job_gnu_mpi.slurm.txt output.log

jpoto commented 1 year ago

For Cray-MPICH I get the same error. output.log

jpoto commented 1 year ago

With the help of the Frontier support staff at the Frontier hackathon we were able to localize the problem. The MPI_Improbe and MPI_Imrecv functions are not supported by the current slingshot (libfabric), see also (https://docs.nersc.gov/current/#ongoing-issues).