NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

Build output does not include libnvperf_dcgm_host.so #160

Closed pintohutch closed 3 months ago

pintohutch commented 3 months ago

Hello,

I am running the build.sh script to build DCGM, however I do not see the libnvperf_dcgm_host.so file generated in the build output in _out/Linux-amd64-release/.

Is there a flag I need to pass to the script to generate this? Or is the library not built using the source from this repo?

Thanks

bmarchant commented 3 months ago

@pintohutch running the build script will produce the libnvperf_dcgm_host.so file in the following directory: _out/Linux-amd64-release/lib/ I just ran it to verify after deleting the _out directory. It is also found in the tests directory: _out/Linux-amd64-release/share/dcgm_tests/apps/amd64/

pintohutch commented 3 months ago

Thanks for the quick response @bmarchant.

I'm on master commit 18b87c715750d0d44b185dfeb9a7d8e2597443a4.

I'm not seeing it. Maybe I'm doing something wrong? I just removed my previous _out/ and re-ran:

cd dcgmbuild
sudo ./build.sh
cd ..
./build.sh

Checking the output contents:

ls _out/Linux-amd64-release/lib
cmake                         libdcgmmodulediag.so          libdcgmmoduleintrospect.so.3      libdcgmmodulepolicy.so.3.3.5  libdcgm_stub.a
libdcgm_cublas_proxy10.so     libdcgmmodulediag.so.3        libdcgmmoduleintrospect.so.3.3.5  libdcgmmodulesysmon.so        libnvml_injection.so
libdcgm_cublas_proxy11.so     libdcgmmodulediag.so.3.3.5    libdcgmmodulenvswitch.so          libdcgmmodulesysmon.so.3      libnvml_injection.so.1
libdcgm_cublas_proxy12.so     libdcgmmodulehealth.so        libdcgmmodulenvswitch.so.3        libdcgmmodulesysmon.so.3.3.5  libnvml_injection.so.1.0
libdcgmmoduleconfig.so        libdcgmmodulehealth.so.3      libdcgmmodulenvswitch.so.3.3.5    libdcgm.so
libdcgmmoduleconfig.so.3      libdcgmmodulehealth.so.3.3.5  libdcgmmodulepolicy.so            libdcgm.so.3
libdcgmmoduleconfig.so.3.3.5  libdcgmmoduleintrospect.so    libdcgmmodulepolicy.so.3          libdcgm.so.3.3.5
ls _out/Linux-amd64-release/share/dcgm_tests/apps/amd64
configuration_sample       field_value_sample         libdcgmmoduleconfig.so.3      libdcgmmodulehealth.so.3          libdcgmmodulenvswitch.so.3      libdcgmmodulesysmon.so.3      libnvml_injection.so.1    stub_library_test
dcgmi                      health_sample              libdcgmmoduleconfig.so.3.3.5  libdcgmmodulehealth.so.3.3.5      libdcgmmodulenvswitch.so.3.3.5  libdcgmmodulesysmon.so.3.3.5  libnvml_injection.so.1.0  testdcgmunittests
dcgmproftester10           libdcgm_cublas_proxy10.so  libdcgmmodulediag.so          libdcgmmoduleintrospect.so        libdcgmmodulepolicy.so          libdcgm.so                    modules_sample
dcgmproftester11           libdcgm_cublas_proxy11.so  libdcgmmodulediag.so.3        libdcgmmoduleintrospect.so.3      libdcgmmodulepolicy.so.3        libdcgm.so.3                  nv-hostengine
dcgmproftester12           libdcgm_cublas_proxy12.so  libdcgmmodulediag.so.3.3.5    libdcgmmoduleintrospect.so.3.3.5  libdcgmmodulepolicy.so.3.3.5    libdcgm.so.3.3.5              policy_sample
DcgmProfTesterKernels.ptx  libdcgmmoduleconfig.so     libdcgmmodulehealth.so        libdcgmmodulenvswitch.so          libdcgmmodulesysmon.so          libnvml_injection.so          process_stats_sample
bmarchant commented 3 months ago

@pintohutch Sorry for the confusion, that library is closed source and allows for "continuous mode profiling" for DC profiling. Apologies for my earlier response, I was looking at the wrong repo.

pintohutch commented 3 months ago

Ah thanks for confirming @bmarchant.

I suppose the best way to get a compatible version of the library would be to pull it from a Docker image with a matching version of the compiled source?

nikkon-dev commented 3 months ago

@pintohutch,

That's right. You can get that library from any official DCGM package (docker/deb/rpm) and place it in the location where nv-hostengine can find it. You will also need to grab the libdcgmmobuleprofiling library.

pintohutch commented 3 months ago

Thanks @nikkon-dev - I didn't notice that was missing as well.

pintohutch commented 3 months ago

@nikkon-dev @bmarchant - my follow-up question here is: what field IDs do the libraries sourced in this OSS repo expose? Compared to what's only available through the closed-source libraries (e.g. libdcgmmobuleprofiling and libnvperf_dcgm_host)?

Is there any documentation around that?

I can close this issue and open a new one to make the ask clearer if that's better

nikkon-dev commented 3 months ago

@pintohutch,

Unfortunately, that is not currently documented. However, I have created a ticket to update the documentation with more accurate details about the modules that provide each Field ID and the differences between OSS and official builds.

All modules not included in OSS can be utilized from official DCGM packages, and DCGM will supply all Field IDs. However, it is difficult to determine which field corresponds to which module.

pintohutch commented 3 months ago

Ok thanks @nikkon-dev.

Lemme know if there's a place I can track that effort. If it's internally tracked, that's fine too.

Feel free to close this as my original question has been answered - thanks for the prompt responses!

nikkon-dev commented 3 months ago

@pintohutch,

That's internally tracked as we have not open-sourced the documentation sources (thus no Github issues).

WBR, Nik

pintohutch commented 3 months ago

Hey @nikkon-dev or @bmarchant - qq: are there any plans to open-source the profiling modules in the future?

nikkon-dev commented 3 months ago

@pintohutch,

I want to clarify that there are currently no plans to use the profiling module for newer architectures. This module was designed for pre-Hopper architectures, and newer architectures utilize GPM functionality via NVML, so it is not needed at all.

It's worth noting that the profiling module relies on undocumented and unofficial APIs that we cannot make open source.

pintohutch commented 3 months ago

Hey @nikkon-dev - thanks for the response and for clarifying this.