NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
906 stars 157 forks source link

Cannot build from source via Ansible #323

Open Godson-A opened 5 months ago

Godson-A commented 5 months ago

What is the version?

3.3.5-3.4.1

What happened?

I installed Dcgm exporter via terminal in one of my GPU node from source code. Initially during the installation following the documentation it was not successful. It failed in the make binary step. So I ran the following command from the Makefile

cd cmd/dcgm-exporter sudo go build -v -ldflags "-X main.BuildVersion=3.3.5-3.4.1"

After this the output is as follows

go: downloading github.com/sirupsen/logrus v1.9.3 go: downloading go.uber.org/automaxprocs v1.5.3 go: downloading github.com/urfave/cli/v2 v2.27.1 go: downloading github.com/NVIDIA/go-dcgm v0.0.0-20240118201113-3385e277e49f go: downloading github.com/stretchr/testify v1.8.4 go: downloading github.com/bits-and-blooms/bitset v1.13.0 go: downloading github.com/gorilla/mux v1.8.1 go: downloading github.com/prometheus/exporter-toolkit v0.11.0 go: downloading golang.org/x/sync v0.5.0 go: downloading google.golang.org/grpc v1.61.1 go: downloading k8s.io/api v0.29.2 go: downloading k8s.io/apimachinery v0.29.2 go: downloading k8s.io/client-go v0.29.2 go: downloading k8s.io/kubelet v0.29.2 go: downloading github.com/NVIDIA/go-nvml v0.12.0-2 go: downloading github.com/go-kit/log v0.2.1 go: downloading golang.org/x/sys v0.16.0 go: downloading github.com/coreos/go-systemd/v22 v22.5.0 go: downloading github.com/prometheus/common v0.47.0 go: downloading golang.org/x/crypto v0.18.0 go: downloading gopkg.in/yaml.v2 v2.4.0 go: downloading github.com/davecgh/go-spew v1.1.1 go: downloading github.com/pmezard/go-difflib v1.0.0 go: downloading gopkg.in/yaml.v3 v3.0.1 go: downloading github.com/Masterminds/semver v1.5.0 go: downloading github.com/go-logfmt/logfmt v0.6.0 go: downloading github.com/gogo/protobuf v1.3.2 go: downloading github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f go: downloading golang.org/x/net v0.20.0 go: downloading golang.org/x/oauth2 v0.16.0 go: downloading github.com/cpuguy83/go-md2man/v2 v2.0.3 go: downloading github.com/xrash/smetrics v0.0.0-20201216005158-039620a65673 go: downloading github.com/jpillora/backoff v1.0.0 go: downloading github.com/prometheus/client_golang v1.18.0 go: downloading github.com/google/gofuzz v1.2.0 go: downloading google.golang.org/genproto/googleapis/rpc v0.0.0-20240102182953-50ed04b92917 go: downloading github.com/russross/blackfriday/v2 v2.1.0 go: downloading gopkg.in/inf.v0 v0.9.1 go: downloading k8s.io/klog/v2 v2.110.1 go: downloading k8s.io/utils v0.0.0-20240102154912-e7106e64919e go: downloading sigs.k8s.io/structured-merge-diff/v4 v4.4.1 go: downloading github.com/golang/protobuf v1.5.3 go: downloading google.golang.org/protobuf v1.33.0 go: downloading sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd go: downloading github.com/beorn7/perks v1.0.1 go: downloading github.com/cespare/xxhash/v2 v2.2.0 go: downloading github.com/prometheus/client_model v0.6.0 go: downloading github.com/prometheus/procfs v0.12.0 go: downloading github.com/go-logr/logr v1.4.1 go: downloading github.com/json-iterator/go v1.1.12 go: downloading golang.org/x/text v0.14.0 go: downloading github.com/google/gnostic-models v0.6.8 go: downloading golang.org/x/time v0.5.0 go: downloading golang.org/x/term v0.16.0 go: downloading k8s.io/kube-openapi v0.0.0-20240220201932-37d671a357a5 go: downloading sigs.k8s.io/yaml v1.4.0 go: downloading github.com/modern-go/reflect2 v1.0.2 go: downloading github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd go: downloading github.com/emicklei/go-restful/v3 v3.11.1 go: downloading github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 go: downloading github.com/google/uuid v1.5.0 go: downloading github.com/go-openapi/jsonreference v0.20.4 go: downloading github.com/go-openapi/swag v0.22.7 go: downloading github.com/go-openapi/jsonpointer v0.20.2 go: downloading github.com/mailru/easyjson v0.7.7 go: downloading github.com/josharian/intern v1.0.0

Then I installed the binary using sudo install binary and I was able to curl the metrics

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4868 0 4868 0 0 4753k 0# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz). --# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz). :--# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C). :-# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).

Issue is

when I do the same with ansible it is not happening as expected and failing at the make binary step. Giving the following error

go: downloading github.com/go-openapi/swag v0.22.7 go: downloading github.com/google/uuid v1.5.0 go: downloading github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 go: downloading github.com/go-openapi/jsonpointer v0.20.2 go: downloading github.com/mailru/easyjson v0.7.7 go: downloading github.com/josharian/intern v1.0.0

github.com/NVIDIA/go-nvml/pkg/nvml

/root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:9:10: undefined: _Ctype_struct_nvmlDevice_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:320:10: undefined: _Ctype_struct_nvmlUnit_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:358:10: undefined: _Ctype_struct_nvmlEventSet_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:505:10: undefined: _Ctype_struct_nvmlGpuInstance_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:548:10: undefined: _Ctype_struct_nvmlComputeInstance_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:552:10: undefined: _Ctype_struct_nvmlGpmSample_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:22:19: undefined: MemoryErrorType /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:25:29: undefined: Return /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:32:49: undefined: Return /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:39:54: undefined: Return /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:39:54: too many errors

github.com/NVIDIA/go-dcgm/pkg/dcgm

/root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:22:13: undefined: mode /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:67:41: undefined: Field_Entity_Group /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:77:33: undefined: Device /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:82:35: undefined: DeviceStatus /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:87:39: undefined: P2PLink /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:93:24: undefined: GroupHandle /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:98:27: undefined: GroupHandle /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:98:53: undefined: ProcessInfo /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:103:38: undefined: DeviceHealth /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:108:60: undefined: policyCondition /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:108:60: too many errors make: *** [Makefile:34: binary] Error 1

What did you expect to happen?

I expect the installation of the exporter should be successful via ansible (since I was able to do it manually though the make binary is not working as expected).

But during ansible run it gives the following output. I have also used ansible privilege escalation but still the same.

go: downloading github.com/go-openapi/swag v0.22.7 go: downloading github.com/google/uuid v1.5.0 go: downloading github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 go: downloading github.com/go-openapi/jsonpointer v0.20.2 go: downloading github.com/mailru/easyjson v0.7.7 go: downloading github.com/josharian/intern v1.0.0

github.com/NVIDIA/go-nvml/pkg/nvml

/root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:9:10: undefined: _Ctype_struct_nvmlDevice_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:320:10: undefined: _Ctype_struct_nvmlUnit_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:358:10: undefined: _Ctype_struct_nvmlEventSet_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:505:10: undefined: _Ctype_struct_nvmlGpuInstance_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:548:10: undefined: _Ctype_struct_nvmlComputeInstance_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/types_gen.go:552:10: undefined: _Ctype_struct_nvmlGpmSample_st /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:22:19: undefined: MemoryErrorType /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:25:29: undefined: Return /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:32:49: undefined: Return /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:39:54: undefined: Return /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-2/pkg/nvml/device.go:39:54: too many errors

github.com/NVIDIA/go-dcgm/pkg/dcgm

/root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:22:13: undefined: mode /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:67:41: undefined: Field_Entity_Group /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:77:33: undefined: Device /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:82:35: undefined: DeviceStatus /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:87:39: undefined: P2PLink /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:93:24: undefined: GroupHandle /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:98:27: undefined: GroupHandle /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:98:53: undefined: ProcessInfo /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:103:38: undefined: DeviceHealth /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:108:60: undefined: policyCondition /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20240118201113-3385e277e49f/pkg/dcgm/api.go:108:60: too many errors make: *** [Makefile:34: binary] Error 1

What is the GPU model?

No response

What is the environment?

No response

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

nvvfedorov commented 5 months ago

It is not bug. You need to check your GO environment settings in the ansible script.

jz543fm commented 5 months ago

@nvvfedorov https://github.com/NVIDIA/dcgm-exporter/issues/321 same issue, problem is that he is building it on a node where is not GPU I think when you check that issue, I had same error log, it is still not mentioned in README.md that you need GPU card to build from source

nvvfedorov commented 5 months ago

@jz543fm , I don't think that the issue is in the absence of the GPU. I suspect, that the datacenter-gpu-manager is missing on the build machine.

nvvfedorov commented 5 months ago

The GPU is necessary for running tests.