charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
200 stars 50 forks source link

UCX: applications do not work without charmrun/mpirun #2477

Open matthiasdiener opened 4 years ago

matthiasdiener commented 4 years ago

Crash observed on application startup on golub:

charm/tests/charm++/simplearrayhello $ ./hello
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: UCX: UcxInitEps: runtime_kvs_put error failed: 5
[0] Stack Traceback:
  [0:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x4d  [0x54800d]
  [0:1]   [0x54812d]
  [0:2] _Z8LrtsInitPiPPPcS_S_+0xd6f  [0x54d2bf]
  [0:3] ConverseInit+0x1ce  [0x54d8de]
  [0:4] charm_main+0x27  [0x48c307]
  [0:5] __libc_start_main+0xf5  [0x2abf9af0b495]
  [0:6]   [0x486360]

Running with Charmrun works fine:

charm/tests/charm++/simplearrayhello $ ./charmrun ./hello

Running on 1 processors:  ./hello
Charm++> Running in non-SMP mode: 1 processes (PEs)
Converse/Charm++ Commit ID: v6.9.0-535-g549280a73
[...]
matthiasdiener commented 4 years ago

Running with charmrun but also with ++local exhibits the same problem.

s-sajid-ali commented 4 years ago

I'm seeing the same error when trying to run namd@git-master built against charmpp@6.10.1 with ucx back-end. (I tried compiling namd@git-master with charm++@git-master but that failed so I ended up building it with 6.10.1 version.) Adding a charmrun in front of the application does not solve the problem

charmrun +p16 namd2 stmv.namd ++mpiexec ++remote-shell mpiexec --oversubscribe </dev/null &> mpirun_$run
srun --mpi=pmix_v2 -N 2 -n 8 -c 2 charmrun namd2 stmv.namd </dev/null &> srun_$run

both fail with an error similar to the above but the srun case first prints the following statements a couple of times before crashing with the UcxInitEps error :

Running on 1 processors:  namd2 stmv.namd
charmrun>  /usr/bin/setarch x86_64 -R  mpirun -np 1  namd2 stmv.namd

What would the workaround for this case be ?

Hoping that this is the right place for this question as opposed to the namd mailing list, let me know if that's not the case and I'll post there instead.

Edit : attached namd build dependencies and configuration namd_build_config.txt

s-sajid-ali commented 4 years ago

I realized that this could be due to the fact that OpenMPI was built without PMIx support. When I built charm++ with the slurmPMI2 backend, everything worked as expected. Apologies for the unnecessary post on the issue tracker.

It might be good to have a check during charm++ configure that would raise a warning about the OpenMPI missing PMIx support when the ompipmix flag is passed.