GoogleCloudPlatform / opentelemetry-operations-collector

Apache License 2.0
41 stars 29 forks source link

Seeming unnecessary CGO depdenency in nvmlreceiver #180

Open Dylan-M opened 1 year ago

Dylan-M commented 1 year ago

Currently, the nvmlreceiver is using a single dependency on a C library here: https://github.com/GoogleCloudPlatform/opentelemetry-operations-collector/blob/master/receiver/nvmlreceiver/testcudakernel/test_cuda_kernel.cc

Based on my review of, but with limited understanding of the Nvidia libraries in general, this could be replaced with a native golang library.

From: https://docs.nvidia.com/cuda/cublas/index.html#id92

cublasStatus_t cublasXtDgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           const double          *beta,
                           double          *C, int ldc)

From: https://pkg.go.dev/gorgonia.org/cu/blas#Standard.Dgemm

func (impl *Standard) Dgemm(tA, tB blas.Transpose, m, n, k int, alpha float64, a []float64, lda int, b []float64, ldb int, beta float64, c []float64, ldc int)

These function definitions seem to be identical, with the exception of the cublasXtHandle. That handle wouldn't be needed in the pure golang implementation; so it makes sense for it to be missing.

If I am correct, this would be a nice change. Anything that can strip a collector down to a pure golang implementation is good in my book.

If I am incorrect in my belief that these libraries are functionally the same, I'm happy to be corrected on it.

braydonk commented 1 year ago

Looping in @suffiank for implementation details here.

Dylan-M commented 1 year ago

Any thoughts? This has come up in discussion again on a slack group.

braydonk commented 1 year ago

Hi @Dylan-M,

Suffian is quite busy so I took a look. I'm not the primary point of contact for GPU-related stuff so this is best effort, but this is what I found.

I think getting rid of the C code would be good. The library does look like it will do what that little C++ load generator does. This would get rid of C++ code in this codebase which would be cleaner.

However, this is unfortunately not the only CGO dependency in this receiver. The NVML client we use is simply Go bindings for libnvidia-ml. See https://github.com/NVIDIA/go-nvml for more info.
What this means is that we probably can indeed get rid of that C++ code and make our tests a bit easier to work with, but it will not unburden this receiver from CGO entirely.

As such I don't think we're going to be able to prioritize this any time soon, but we welcome a PR to make that change that gets rid of C++ in this codebase. Thank you for raising the issue!

Dylan-M commented 1 year ago

Thank you for having a look, a coworker also pointed out what you said about linking to that external library; which I had forgotten to look at external deps. Still a change that is worth making, however I fully agree with your assessment and prioritization.

quentinmit commented 1 year ago

In Ops Agent we run a demo app from the CUDA SDK: https://github.com/GoogleCloudPlatform/ops-agent/blob/master/integration_test/third_party_apps_data/applications/nvml/exercise

But as @braydonk mentions, go-nvml requires cgo for everything, so we can't get GPU metrics without it.