geopmd is showing up as a leftover process on the GPUs at the conclusion of an app run

bgeltz commented 1 year ago

Describe the bug I tried to run an application with GEOPM and I expected the GPUs to show no clients at the conclusion of a run instead a client remains, pointing to "geopmd".

GEOPM version Installed service version: 2.0.0+dev411g47f3f6e3 Userspace runtime version: 4cf0e9596 build.sh invocation: GEOPM_BASE_CONFIG_OPTIONS="--with-sqlite3=${HOME}/build/sqlite3 --enable-beta" GEOPM_SKIP_SERVICE_INSTALL=yes ./integration/config/build.sh

Expected behavior Run the GPU workload with GEOPM:

$ geopmagent -a gpu_activity -p 0.5 > gpu_activity.policy
$ geopmlaunch pals -n 1 --geopm-affinity-disable --geopm-agent=gpu_activity --geopm-policy=gpu_activity.policy -- ~/geopm/integration/apps/parres/Kernels/Cxx11/dgemm-onemkl 10 1600
... <app_output> ...
$ echo $?
0

Observe that there are no clients on the GPUs at the conclusion of the run:

$ ls -la /sys/class/drm/card*/clients/*
ls: cannot access '/sys/class/drm/card*/clients/*': No such file or directory

Actual behavior Every card* directory has one of these entries indicating a geopmd process is still on the GPUs:

$ ls -la /sys/class/drm/card0/clients
total 0
drwxr-xr-x 3 root root 0 May 23 06:11 .
drwxr-xr-x 8 root root 0 May 23 06:11 ..
drwxr-xr-x 4 root root 0 May 24 16:52 856

$ cat /sys/class/drm/card0/clients/856/name
geopmd
$ cat /sys/class/drm/card0/clients/856/pid
201474

$ systemctl status geopm
● geopm.service - Global Extensible Open Power Manager Service
     Loaded: loaded (/usr/lib/systemd/system/geopm.service; enabled; vendor preset: disabled)
     Active: active (running) since Wed 2023-05-24 16:50:44 PDT; 3min 56s ago
   Main PID: 201474 (geopmd)
      Tasks: 3
     CGroup: /system.slice/geopm.service
             └─201474 /usr/bin/python3 /usr/bin/geopmd

Additional context Restarting the service cleans up the processes on the GPUs.

asmaalrawi commented 2 months ago

Check if this is still reproducible.

cmcantalupo commented 1 month ago

I think this is expected if the PlatformIO::read_signal() or PlatformIO::write_control() are called through the ServiceIOGroup. The batch interface will not show this issue due to the fact that the forked process (the batch server) is the one that initiates the "client" on the GPU.

bgeltz commented 1 month ago

This is still happening. On a fresh compute node allocation:

[bgeltz@node0 ~]$ cat /sys/class/drm/card*/clients/*/name
geopmd
geopmd
geopmd
geopmd
geopmd
geopmd

I can remove these clients by restarting the service. They will return upon any invocation of geopmread, geopmwrite, geopmsession, etc.

bgeltz commented 1 month ago

Open docu tasks for:

This issue. Sysadmins of L0/PVC systems may want to kill or restart the service to purge these clients after a run (e.g. in a job epilog)
geopmread, etc. timing (i.e. IOGroup initialization) of first call after a service restart

cmcantalupo commented 1 month ago

opened new issues to deal with the problem

geopm / geopm

geopmd is showing up as a leftover process on the GPUs at the conclusion of an app run #2956