NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
Apache License 2.0
311 stars 64 forks source link

bug for using DeviceGetFieldValues #46

Closed dreamryx closed 2 years ago

dreamryx commented 2 years ago

This api does not put the target result value into the memory. Instead, it adds the value(dirty data in the memory where plan to store the result from DeviceGetFieldValues api) with the real result to return. Therefore, when we use the same memory structure to repeatedly call the api, the results obtained each time will increase linearly.

For example, this api need to use following struct:

type FieldValue struct {
    FieldId     uint32
    ScopeId     uint32
    Timestamp   int64
    LatencyUsec int64
    ValueType   uint32
    NvmlReturn  uint32
    Value       [8]byte
}

The Value [8]byte is used to store and return the result value.

  1. Set the FieldId to 5(which means the NVML_FI_DEV_ECC_SBE_AGG_TOTAL) to get the total single bit aggregate (persistent) ECC errors.
  2. What's more, set the Value [8]byte of the input argument FieldValue to int8(8) value.
  3. Suppose the current value of SBE(single bit aggregate (persistent) ECC errors) from gpu0 is 10.
  4. The result get from the Value [8]byte is 18.
  5. And if we use the same memory(FieldValue) as the input argument to call DeviceGetFieldValues to check the SBE of gpu0 repeatedly. The result could be: 18, 28, 38, 48, 58, 68, 78, 88....... increase linearly!
elezar commented 2 years ago

@dreamryx looking at how this is implemented in the wrapper, we see:

// nvml.DeviceGetFieldValues()
func DeviceGetFieldValues(Device Device, Values []FieldValue) Return {
    ValuesCount := len(Values)
    return nvmlDeviceGetFieldValues(Device, int32(ValuesCount), &Values[0])
}

Which leads me to assume that the "addition" is happening in nvml itself. Would you be able to repeat the test with code in C instead to confirm this?

dreamryx commented 2 years ago

@elezar C is OK. I set nvmlValue_st.value.uiVal to a number(eg. 8) and use it as input argument to call NVML_FI_DEV_ECC_SBE_AGG_TOTAL. The result is correct.

klueska commented 2 years ago

Hmm. Somthing strange is going on then because this is the entirety of the go code for this:

func DeviceGetFieldValues(Device Device, Values []FieldValue) Return {
    ValuesCount := len(Values)
    return nvmlDeviceGetFieldValues(Device, int32(ValuesCount), &Values[0])
}

func nvmlDeviceGetFieldValues(Device Device, ValuesCount int32, Values *FieldValue) Return {
    cDevice, _ := *(*C.nvmlDevice_t)(unsafe.Pointer(&Device)), cgoAllocsUnknown
    cValuesCount, _ := (C.int)(ValuesCount), cgoAllocsUnknown
    cValues, _ := (*C.nvmlFieldValue_t)(unsafe.Pointer(Values)), cgoAllocsUnknown
    __ret := C.nvmlDeviceGetFieldValues(cDevice, cValuesCount, cValues)
    __v := (Return)(__ret)
    return __v
}

It just translates the C call into the Go world and back. It never looks at or does anything with any of the fields.

elezar commented 2 years ago

Since FieldValue includes a union does it require special handling: https://groups.google.com/g/golang-nuts/c/G02z30Jncbk/m/RdPtTmllSTwJ ?

klueska commented 2 years ago

@elezar What is your question exactly? The union field in the C world is a [8]byte in the Go world so that you can treat the 8 bytes as whatever you want similar to a union. The difference being that you have to do an explicit cast in the Go world (from the type you want to assign to a [8]byte) instead of an uncasted assignment in the C world (to the name of the field associated with the type you want to assign).

klueska commented 2 years ago

@dreamryx Maybe it would help to see the exact Go code and the exact C code that you are running so that we can reproduce (and / or) spot an issue in the assignment being made.

dreamryx commented 2 years ago

@elezar @klueska Sorry, I made a mistake. Only when I set FieldId to NVML_FI_DEV_ECC_SBE_AGG_TOTAL, this bug can be seen. The C api has this bug.

dreamryx commented 2 years ago

@klueska @elezar 30 minutes ago, when I set FieldId to NVML_FI_DEV_NVLINK_LINK_COUNT and value.uiVal to a number(eg. 8), the result is correct(12). But if I set to NVML_FI_DEV_ECC_SBE_AGG_TOTAL, the bug shows in both go and C api.

klueska commented 2 years ago

The fact that the field ID has AGG_TOTAL in it (i.e. aggregate total) may suggest that this is working as expected. You would need to check with the NVML team though. We only maintain the Go wrapper.

The best place to file an issue for NVML is here: https://forums.developer.nvidia.com/c/development-tools/other-tools/system-management-and-monitoring-nvml

dreamryx commented 2 years ago

The fact that the field ID has AGG_TOTAL in it (i.e. aggregate total) may suggest that this is working as expected. You would need to check with the NVML team though. We only maintain the Go wrapper.

The best place to file an issue for NVML is here: https://forums.developer.nvidia.com/c/development-tools/other-tools/system-management-and-monitoring-nvml

OK. Thanks a lot!