RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.64k stars 226 forks source link

likwid-bridge: a helper tool to run likwid in containers like Apptainer #626

Open pradt2 opened 1 month ago

pradt2 commented 1 month ago

Hi,

Introduction

Most HPC systems grant only a basic set of permissions (i.e. no sudo, no CAP_SYS_ADMIN). This means that only non-sudo container technologies are ready available to the end users.

Using likwid in unprivileged containers like Apptainer can be problematic. The perf_event method does work provided that the per_event_paranoid setting is low enough, but the data is often limited (the clusters I've tested had all perf_event_paranoid=2). Setuid binaries don't get any additional privileges when executed inside an unprivileged container, so spawning access daemons inside the container isn't an option either.

Idea

This PR demonstrates how we can utilise the power of the access daemon installed on the host machine inside Apptainer. The general idea is to have a process on the host that listens for requests from within Apptainer, spawns the access daemon processes on the host, and communicates back the paths of the newly spawned access deamons, so that likwid can use them from within Apptainer.

This is possible because the /tmp filesystem is shared between host and Apptainer, and so any socket-based communication over /tmp mounted sockets works as expected.

Implementation

This PR consists of two changes, a new cmd utility that I call likwid-bridge in src/bridge/bridge.c, and a change to the src/access_client.c file.

The new cmd utility listens on the host side and spawns access daemons on demand. It should be invoked as follows:

> likwid-bridge apptainer exec mycontainer.sif likwid-perfctr args...

The bridge binary acts as a wrapper - it starts listening for requests, and executes whatever arguments we give it in a child process. When the child process finishes, it stops listening for new requests, cleans up, and exits.

When likwid-bridge starts listening for requests, it exposes a new environment variable LIKWID_BRIDGE_PATH which is then used by likwid-perfctr to decide how to spawn the access daemons (either directly via execve() or via the bridge). All the necessary changes to likwid-perfctr are confined in the src/access_client.c file.

Code state

The changes to the src/access_client.c mimic the style of the file. I think minor changes may be required.

The new file src/bridge/bridge.c is at the moment a completely standalone file. It does not use many parts of the project that it perhaps should be (i.e. logging). Furthermore, it's not hooked into the project build system. I expect that if you are willing to incorporate these changes, some potentially major work will be needed to make this new utility a first-class citizen of this project.

Summary

This PR introduces changes that allow me to use likwid inside Apptainer. If you're interested in incorporating them in, I'm willing to put in the effort to make the changes that are needed to achieve this.

TomTheBear commented 1 month ago

Many thanks for looking into this. It is definitely required nowadays to have some support for containers.

Code looks reasonable but I have to test it how it behaves in various situations (errors in the container runtime, errors in likwid-perfctr inside the container, errors in the host's accessDaemon, ...). Some error printing or debug information is probably helpful to analyze future issues with the bridge.

The incorporation in the build system is on me, it might get iffy. But what is your opinion: Should the bridge always be installed or put it behind a configuration option?

pradt2 commented 1 month ago

I think I'd put it behind a configuration flag, much like other likwid utilities are set up (e.g. likwid-setFreq). From my point of view as a user, I'd prefer the feature to be enabled by default, so that I don't have to ask sysadmins explicitly for it.