We discovered that HPE's internal systems have more built in k8s capabilities than those on LLNL's hetchy:
It seems that the security settings for running on hetchy are more strict than those in the test environment used at HPE. We found that we needed to add the following security context to the demo worker container in order for ssh (and mpirun) to successfully connect to the worker:
nnf-sos should be updated to ensure these capabilities are added to the user container's workflow. Flux/users shouldn't need to define these themselves.
Not having this in the worker's podSpec was preventing the MPI launcher container from being able to ssh to the MPI worker container on hetchy. We haven't seen this issue on HPE systems.
We discovered that HPE's internal systems have more built in k8s capabilities than those on LLNL's
hetchy
:nnf-sos should be updated to ensure these capabilities are added to the user container's workflow. Flux/users shouldn't need to define these themselves.
Not having this in the worker's podSpec was preventing the MPI launcher container from being able to ssh to the MPI worker container on hetchy. We haven't seen this issue on HPE systems.