NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

User Containers: Ensure linux capabilities are set for MPI workers #77

Closed bdevcich closed 1 year ago

bdevcich commented 1 year ago

We discovered that HPE's internal systems have more built in k8s capabilities than those on LLNL's hetchy:

It seems that the security settings for running on hetchy are more strict than those in the test environment used at HPE. We found that we needed to add the following security context to the demo worker container in order for ssh (and mpirun) to successfully connect to the worker:

securityContext:
allowPrivilegeEscalation: true
capabilities:
add:
- NET_BIND_SERVICE
- SYS_CHROOT
- AUDIT_WRITE

nnf-sos should be updated to ensure these capabilities are added to the user container's workflow. Flux/users shouldn't need to define these themselves.

Not having this in the worker's podSpec was preventing the MPI launcher container from being able to ssh to the MPI worker container on hetchy. We haven't seen this issue on HPE systems.

bdevcich commented 1 year ago

This has been fixed via https://github.com/NearNodeFlash/nnf-sos/pull/209. The workaround in the nnf-container-example should no longer be needed.