Open Dylan-M opened 1 year ago
Looping in @suffiank for implementation details here.
Any thoughts? This has come up in discussion again on a slack group.
Hi @Dylan-M,
Suffian is quite busy so I took a look. I'm not the primary point of contact for GPU-related stuff so this is best effort, but this is what I found.
I think getting rid of the C code would be good. The library does look like it will do what that little C++ load generator does. This would get rid of C++ code in this codebase which would be cleaner.
However, this is unfortunately not the only CGO
dependency in this receiver. The NVML client we use is simply Go bindings for libnvidia-ml
. See https://github.com/NVIDIA/go-nvml for more info.
What this means is that we probably can indeed get rid of that C++ code and make our tests a bit easier to work with, but it will not unburden this receiver from CGO
entirely.
As such I don't think we're going to be able to prioritize this any time soon, but we welcome a PR to make that change that gets rid of C++ in this codebase. Thank you for raising the issue!
Thank you for having a look, a coworker also pointed out what you said about linking to that external library; which I had forgotten to look at external deps. Still a change that is worth making, however I fully agree with your assessment and prioritization.
In Ops Agent we run a demo app from the CUDA SDK: https://github.com/GoogleCloudPlatform/ops-agent/blob/master/integration_test/third_party_apps_data/applications/nvml/exercise
But as @braydonk mentions, go-nvml requires cgo for everything, so we can't get GPU metrics without it.
Currently, the nvmlreceiver is using a single dependency on a C library here: https://github.com/GoogleCloudPlatform/opentelemetry-operations-collector/blob/master/receiver/nvmlreceiver/testcudakernel/test_cuda_kernel.cc
Based on my review of, but with limited understanding of the Nvidia libraries in general, this could be replaced with a native golang library.
From: https://docs.nvidia.com/cuda/cublas/index.html#id92
From: https://pkg.go.dev/gorgonia.org/cu/blas#Standard.Dgemm
These function definitions seem to be identical, with the exception of the cublasXtHandle. That handle wouldn't be needed in the pure golang implementation; so it makes sense for it to be missing.
If I am correct, this would be a nice change. Anything that can strip a collector down to a pure golang implementation is good in my book.
If I am incorrect in my belief that these libraries are functionally the same, I'm happy to be corrected on it.