amd / InfinityHub-CI

MIT License
12 stars 5 forks source link

hpl-mxp: PMIX ERROR when running with singularity #10

Closed daviteix closed 5 months ago

daviteix commented 5 months ago

Why is the following command: singularity run --writable-tmpfs --pwd /benchmark ./hpl-ai.sif mpirun -np 8 --map-by node:PE=1 hpl-ai -P 4 -B 2560 -N 332800 failing with the following error?: PMIX ERROR: NO-PERMISSIONS in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/common/dstore/dstore_base.c at line 237

My Config: ROCm 5.7 RHEL8 Singularity 4.0.2 4xMI250

cmcknigh commented 5 months ago

@daviteix I have tried to recreate with a similar setup. Node Configuration:

ROCm 6.0.2
RHEL8.9
Singularity 3.10.2 and 4.1.2
4xMi250

Using the same input provided, I was able to successfully complete the HPL-MxP

=============================================================================
=               HPL-AI Mixed-Precision Benchmark for AMD GPUs               =
=============================================================================
#BEGIN: Mon Apr 22 14:32:07 2024
#END__: Mon Apr 22 14:32:34 2024
...

PMIX is a library within OpenMPI, the issue likely resides in the permissions your user has on the host system, that can limit the permissions within a container.

daviteix commented 5 months ago

Yes, PMIX is related to OpenMPI but isn't it using the openmpi provided inside the container? Did you run your tests as root (in which case you would not see the issue)?

cmcknigh commented 5 months ago

I am not running as root, so that can be be ruled out. That message appears to be referring to source code, but the source code was removed after install. Which would make sense that it has no permissions to something that doesn't exist. Try running with argument --mca pmix_base_verbose 100 before hpl-ai to see if there is more we can find out from debug.

daviteix commented 5 months ago

Thank you for looking into this. The error message just mentions the file path and line number which can be referenced in C/C++ with the macros FILE and LINE: the compiler replaces them at compile time and includes the info in the output binary. The source code does not have to be present when running the app. The error message just points to where in the code it occurred, not that it is trying to access that particular file. I will try adding the extra parameters and let you know.

daviteix commented 5 months ago

The error appears if my tmp directory is on NFS. If I set it to a local disk it works. The error is caused by the call sys call unlink(2) fail. Thank you, I will close this issue.