Cambricon / mlu-exporter

Apache License 2.0
24 stars 9 forks source link

request metrics result in "was collected before with the same name and label values" #2

Closed lxyzhangqing closed 1 year ago

lxyzhangqing commented 1 year ago

1. Issue or feature description

An error occurred when I requested metrics, like below:

An error has occurred while serving metrics:

15 error(s) occurred:
* collected metric "process_jpu_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"0" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110302781" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"81273010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_jpu_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_jpu_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_memory_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:63 > } was collected before with the same name and label values
* collected metric "process_memory_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:63 > } was collected before with the same name and label values
* collected metric "process_memory_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"0" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110302781" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"81273010-2153-0000-0000-000000000000" > gauge:<value:40 > } was collected before with the same name and label values
* collected metric "process_vpu_encode_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"0" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110302781" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"81273010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_vpu_encode_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_vpu_encode_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_vpu_decode_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"0" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110302781" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"81273010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_vpu_decode_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_vpu_decode_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_ipu_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"0" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110302781" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"81273010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_ipu_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "process_ipu_utilization" { label:<name:"driver" value:"v4.20.14" > label:<name:"mcu" value:"v1.1.5" > label:<name:"mlu" value:"1" > label:<name:"model" value:"MLU370-S4" > label:<name:"node" value:"" > label:<name:"pid" value:"1" > label:<name:"sn" value:"532110300618" > label:<name:"type" value:"mlu370" > label:<name:"uuid" value:"18063010-2153-0000-0000-000000000000" > gauge:<value:0 > } was collected before with the same name and label values

2. Steps to reproduce the issue

3. Information

My environment:

I did some debugging and found that the cause of the problem was an error in the return value of the function pkg/cndev/cndev.go -> GetDeviceProcessUtil, the return value of pids is always the same value in the array.

My debugging info like bellow: source code

func (c *cndev) GetDeviceProcessUtil(idx uint) ([]uint32, []uint32, []uint32, []uint32, []uint32, []uint32, error) {
        fmt.Println("GetDeviceProcessUtil enter")
        defer fmt.Println("GetDeviceProcessUtil leave")
        processCount := 10 // maximum number of processes running on an MLU is 10
        var util C.cndevProcessUtilization_t
        utils := (*C.cndevProcessUtilization_t)(C.malloc(C.size_t(processCount) * C.size_t(unsafe.Sizeof(util))))
        defer C.free(unsafe.Pointer(utils))
        utils.version = C.uint(version)
        r := C.cndevGetProcessUtilization((*C.uint)(unsafe.Pointer(&processCount)), utils, C.int(idx))
        if err := errorString(r); err != nil {
                fmt.Printf("GetDeviceProcessUtil failed: %v\n", err)
                return nil, nil, nil, nil, nil, nil, err
        }
        fmt.Printf("GetDeviceProcessUtil processCount=%v\n", processCount)
        pids := make([]uint32, processCount)
        ipuUtils := make([]uint32, processCount)
        jpuUtils := make([]uint32, processCount)
        memUtils := make([]uint32, processCount)
        vpuDecUtils := make([]uint32, processCount)
        vpuEncUtils := make([]uint32, processCount)
        array := (*[10]C.cndevProcessUtilization_t)(unsafe.Pointer(utils))
        results := array[:processCount]
        for i := 0; i < processCount; i++ {
                pids[i] = uint32(results[i].pid)
                ipuUtils[i] = uint32(results[i].ipuUtil)
                jpuUtils[i] = uint32(results[i].jpuUtil)
                memUtils[i] = uint32(results[i].memUtil)
                vpuDecUtils[i] = uint32(results[i].vpuDecUtil)
                vpuEncUtils[i] = uint32(results[i].vpuEncUtil)
        }
        fmt.Printf("GetDeviceProcessUtil pids = %v\n", pids)
        fmt.Printf("GetDeviceProcessUtil ipuUtils = %v\n", ipuUtils)
        fmt.Printf("GetDeviceProcessUtil jpuUtils = %v\n", jpuUtils)
        fmt.Printf("GetDeviceProcessUtil memUtils = %v\n", memUtils)
        fmt.Printf("GetDeviceProcessUtil vpuDecUtils = %v\n", vpuDecUtils)
        fmt.Printf("GetDeviceProcessUtil vpuEncUtils = %v\n", vpuEncUtils)
        return pids, ipuUtils, jpuUtils, memUtils, vpuDecUtils, vpuEncUtils, nil
}

output info

GetDeviceProcessUtil enter
GetDeviceProcessUtil processCount=2
GetDeviceProcessUtil pids = [1 1]
GetDeviceProcessUtil ipuUtils = [7 0]
GetDeviceProcessUtil jpuUtils = [0 0]
GetDeviceProcessUtil memUtils = [40 40]
GetDeviceProcessUtil vpuDecUtils = [0 0]
GetDeviceProcessUtil vpuEncUtils = [0 0]
GetDeviceProcessUtil leave
GetDeviceProcessUtil enter
GetDeviceProcessUtil processCount=3
GetDeviceProcessUtil pids = [1 1 1]
GetDeviceProcessUtil ipuUtils = [20 0 0]
GetDeviceProcessUtil jpuUtils = [0 0 0]
GetDeviceProcessUtil memUtils = [62 62 62]
GetDeviceProcessUtil vpuDecUtils = [0 0 0]
GetDeviceProcessUtil vpuEncUtils = [0 0 0]
GetDeviceProcessUtil leave
YuxiJin-tobeyjin commented 1 year ago

Yes, I guess there is something wrong with the description in our "README" about how to use exporter directly in the docker scene.

According to the yaml used in K8s, we has set env and hostPID , so in docker scene command should be like:

docker run -d -p 30108:30108 --pid=host -e ENV_NODE_NAME={nodeName} --privileged=true cambricon-mlu-exporter:v1.6.7

@lxyzhangqing Thanks a lot for pointing out this problem, would you please have a try with the command provided above and let us know if there is any problem

@rayroot PLS update our "README" ASAP

lxyzhangqing commented 1 year ago

Yes, when I added --pid=host -e ENV_NODE_NAME={nodeName}, it worked fine. Thanks. @YuxiJin-tobeyjin