NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; #408

Open jicki opened 3 weeks ago

jicki commented 3 weeks ago

What is the version?

3.3.8-3.6.0-ubuntu22.04

What happened?

dcgm-exporter-m9prp   0/1     CrashLoopBackOff
time="2024-10-29T09:58:01Z" level=error msg="Failed to write response." error="failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724376 vs. 4194304)"
2024/10/29 09:58:01 http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.(*MetricsServer).Metrics (server.go:124)
time="2024-10-29T09:58:26Z" level=error msg="Failed to collect metrics; err: failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724552 vs. 4194304)"
time="2024-10-29T09:58:31Z" level=error msg="Failed to write response." error="failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724492 vs. 4194304)"
2024/10/29 09:58:31 http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.(*MetricsServer).Metrics (server.go:124)
time="2024-10-29T09:58:32Z" level=error msg="Failed to write response." error="failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724492 vs. 4194304)"
2024/10/29 09:58:32 http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.(*MetricsServer).Metrics (server.go:124)
time="2024-10-29T09:58:56Z" level=error msg="Failed to collect metrics; err: failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724376 vs. 4194304)"
time="2024-10-29T09:59:01Z" level=error msg="Failed to write response." error="failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724376 vs. 4194304)"
2024/10/29 09:59:01 http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.(*MetricsServer).Metrics (server.go:124)
time="2024-10-29T09:59:02Z" level=error msg="Failed to write response." error="failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724376 vs. 4194304)"
2024/10/29 09:59:02 http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.(*MetricsServer).Metrics (server.go:124)
time="2024-10-29T09:59:26Z" level=error msg="Failed to collect metrics; err: failed to transform metrics for transform 'podMapper'; err: failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4724492 vs. 4194304)"

What did you expect to happen?

running dcgm-exporter

What is the GPU model?

No response

What is the environment?

No response

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response