Open BryanQuigley opened 7 months ago
@BryanQuigley , Thank you for the suggestion. Today, dcgm-exporter uses the following logic:
When the no-hostname
config option is false (default value), the dcgm-exporter attempts to get the hostname from the NODE_NAME
environment variable and container hostname as a fallback option.
Can you check what you see on the PBS instance?
Thanks for the quick reply! Sorry, I missed some key bits. We can't use DCGM-exporter in each container which I believe is what was described above.
We want to deploy dcgm-exporter to the host and have it report on all containers on the host. The hostname field is currently working as we would want. We want to add the container field similar to what is done in k8s.
The simple case is just one container to parse, but on some nodes there maybe 8 containers we want metrics from (and the reason we are moving from nvidia-smi metrics is MIG support, so that means potentially a lot of individual containers).
Docker inspect path
PBS Specific
There may be other ways to associate it to a process or control group, but I don't see obvious ways to get back to container name. Thanks!
@BryanQuigley , Please run hostname
command inside the docker container.
@BryanQuigley, I see where the problem is. Today, dcgm-exporter, uses k8s API to get a list of pods and containers and use this information to map containers and devices. We need to evaluate and prioritize this feature request.
@BryanQuigley , What is the PBS?
High performance computing workload manager (has open source and closed versions - I believe for this purpose they are the same) https://openpbs.org/ https://altair.com/pbs-professional/
@BryanQuigley , Re: GPU files in /var/spool/pbs/mom_priv/jobs/.GPU - is something configurable? What is inside the ".GPU" file? Can the file contain a job name?
If the file contains a job name, we may read the files and provide labels: GPU => Job Name, for example.
So currently it's
cat /var/spool/pbs/mom_priv/jobs/12345.hpcq.GPU
/dev/nvidia2
Are you saying it could work today if it was instead:
cat /var/spool/pbs/mom_priv/jobs/12345.hpcq.GPU
/dev/nvidia2=12345.hpcq
Or have 1 file with all the job ids to nvidia devices?
I have a similar request about another workflow manager for the HPC. We consider a file format something like this:
Is it something that you can configure on your environment?
I'll check on how configurable that is.
We do have containers that use more than one gpu so the file ends up looking like /dev/nvidia7 ... /dev/nvidia0
We should be able to create custom other files to your spec. It will be configurable what files to pull it from, yes?
Just as long as a GPU can be associated with multiple containers we should be good. This should work with MIG device names too?
@BryanQuigley , what do you mean: "configurable what files to pull "? We can do the following, for example, you pass a path to a directory, where the DCGM-exporter can find files, where the file name is GPU ID (numeric, 0,1,2, etc), and each line of the file assumed is job label value.
path to a directory works ^
Sorry for delay. Yes, it is configurable. Ideally we get 1 job to multiple GPUs though.
@BryanQuigley, You can try our integration with HPC in the new version: https://github.com/NVIDIA/dcgm-exporter/releases/tag/3.3.6-3.4.2. Link to readme: https://github.com/NVIDIA/dcgm-exporter/blob/main/README.md#how-to-include-hpc-jobs-in-metric-labels
Thanks! Will give a try
Hi, I have a question regarding environments that are not running on top of HPC (only a single server with GPU). How can I get container name in metrics?
I'm not using K8S but want to collect container name as part of metrics. Each job is run in a container and the container name matches the jobid we want to query by.
I'm hoping maybe I'm missing an option to have the container name metrics be collected via docker (or podman) env variables. docker inspect shows what devices are visible via NVIDIA_VISIBLE_DEVICES.
Or grab output from a file file which will output device names like:
/dev/nvidia5.
Any other approaches welcome!