NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
Apache License 2.0
290 stars 62 forks source link

Add internal method to get device handle #119

Closed elezar closed 4 months ago

elezar commented 4 months ago

This was done in go-nvlib in https://github.com/NVIDIA/go-nvlib/pull/17 but not included when we implemented the interfaces here.

Without the explict nvmlDevice conversion we see:

=== RUN   TestGetTopologyCommonAncestor
--- FAIL: TestGetTopologyCommonAncestor (0.00s)
panic: interface conversion: nvml.Device is nvml.wrappedDevice, not nvml.nvmlDevice [recovered]
        panic: interface conversion: nvml.Device is nvml.wrappedDevice, not nvml.nvmlDevice

goroutine 19 [running]:
testing.tRunner.func1.2({0x1030fc500, 0xc0000b10b0})
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1631 +0x24a
testing.tRunner.func1()
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1634 +0x377
panic({0x1030fc500?, 0xc0000b10b0?})
        /usr/local/Cellar/go/1.22.2/libexec/src/runtime/panic.go:770 +0x132
github.com/NVIDIA/go-nvml/pkg/nvml.nvmlDevice.GetTopologyCommonAncestor(...)
        /Users/elezar/src/go-nvml/pkg/nvml/device.go:223
github.com/NVIDIA/go-nvml/pkg/nvml.TestGetTopologyCommonAncestor(0xc0000c29c0)
        /Users/elezar/src/go-nvml/pkg/nvml/device_test.go:38 +0x85
testing.tRunner(0xc0000c29c0, 0x10313b4c8)
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1689 +0xfb
created by testing.(*T).Run in goroutine 1
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1742 +0x390
FAIL    github.com/NVIDIA/go-nvml/pkg/nvml      0.612s
klueska commented 4 months ago

I don't think this approach will work in general. It works for new types that embed the Device type, but not for those that want to implement the Device interface from scratch (e.g. our mock implementation) because they have no way of providing an implementation for the unexported method.

Would an alternative be to implement the following internal function directly? I realize it relies on reflection, but maybe this is OK since it won't actually be called very often...

func nvmlDeviceHandle(d Device) nvmlDevice {
    if nvmlDevice, ok := d.(nvmlDevice); ok {
        return nvmlDevice
    }

    val := reflect.ValueOf(d)
    typ := val.Type()
    for i := 0; i < typ.NumField(); i++ {
        if !typ.Field(i).Anonymous {
            continue
        }
        if _, ok := val.Field(i).Interface().(Device); !ok {
            continue
        }
        return nvmlDeviceHandle(val.Field(i).Interface().(Device))
    }
    panic("Unable to convert Device to underlying nvmlDevice")
}

Here it is in the go playground: https://go.dev/play/p/AnJXQpCrzGZ?v=gotip.go?download=true.go?download=true

elezar commented 4 months ago

There is definitely a concern that this would not work in general. I also considered reflect, but didn't have time to properly flesh this out.

Note that the issue here is not the concrete type to which the function is attached, but the fact that we're accepting an interface type for one of the functions and expect this argument to be convertable to the base type because we need to call an internal function with this type. It is this call to the internal function that causes a panic when the input cannot be converted to the concrete type.

I have incorportated your nvmlDeviceHandle function in the latest revision and we will still panic in cases where we cannot perform this conversion, but at least cover the wrapped cases. For users implementing their own Device types from scratch, they will have to update the relevant functions accordingly.

Another option would be to add a function that provides the underlying reference as a public function. This may still not make sense in mocks, but would allow consumers to implement this as they require.

klueska commented 4 months ago

I'm still sad there isn't a good way to do this without reflection, but I think the current implementation is pretty easy to understand and covers all our bases.