NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

Does DCGM supports creating groups of GPU from different hosts? #146

Open deferen2 opened 8 months ago

deferen2 commented 8 months ago

I’ve read the section on groups in the documentation, but I’m still unclear about the limitations of the Groups feature. I’m not sure if it’s restricted to creating a group composed of GPUs from a single host, or if it’s possible to group cards from different hosts.

There is this line in the DCGM documentation that makes me think that GPU groups are limited to a single host: "Almost all DCGM operations take place on groups. Users can create, destroy and modify collections of GPUs on the local node"

But then there is no reference to this limitation again, and in the overview it is written a generic: "... and individual users managing groups of NVIDIA GPUs."

So I was wondering, are the groups limited to single hosts?

Thanks.

nikkon-dev commented 8 months ago

@deferen2,

Yes, groups are limited to a single nv-hostengine instance. Internally, groups are just a list of entities local to the hostengine without any special logic attached to it.

WBR, Nik