Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
626 stars 145 forks source link

vgpu-monitor panic #318

Open kebe7jun opened 3 months ago

kebe7jun commented 3 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

vgpu-monitor panic like:

I0524 07:18:41.241050  580468 metrics.go:324] Initializing metrics for vGPUmonitor
I0524 07:18:41.241133  580468 pathmonitor.go:159] server listening at [::]:9395
I0524 07:18:46.364588  580468 pathmonitor.go:126] Adding ctr dirname /usr/local/vgpu/containers/2b1034f8-6043-4a75-a8b2-81186aea4c6d_neko-notebook in monitorpath
I0524 07:18:46.364614  580468 pathmonitor.go:56] Checking path /usr/local/vgpu/containers/2b1034f8-6043-4a75-a8b2-81186aea4c6d_neko-notebook
I0524 07:18:46.364672  580468 pathmonitor.go:126] Adding ctr dirname /usr/local/vgpu/containers/5c8316be-0c88-479c-85fc-475f3874d05f_stable-diffusion in monitorpath
I0524 07:18:46.364675  580468 pathmonitor.go:56] Checking path /usr/local/vgpu/containers/5c8316be-0c88-479c-85fc-475f3874d05f_stable-diffusion
I0524 07:18:46.364840  580468 pathmonitor.go:83] getvGPUMemoryInfo success with utilizationSwitch=1, recentKernel=2, priority=1
I0524 07:18:46.364871  580468 pathmonitor.go:126] Adding ctr dirname /usr/local/vgpu/containers/ac28ea1c-3f9e-40cd-ae86-034722d70f0f_dataset-loader in monitorpath
sizeof= 1197896 cachestr= 1 2
I0524 07:18:46.364873  580468 pathmonitor.go:56] Checking path /usr/local/vgpu/containers/ac28ea1c-3f9e-40cd-ae86-034722d70f0f_dataset-loader
sizeof= 1197896 cachestr= 1 2
I0524 07:18:46.365207  580468 pathmonitor.go:83] getvGPUMemoryInfo success with utilizationSwitch=1, recentKernel=2, priority=1
I0524 07:18:46.365239  580468 pathmonitor.go:126] Adding ctr dirname /usr/local/vgpu/containers/f41d2f6f-fc03-487f-8e7a-cf14ee5ec701_pytorch in monitorpath
I0524 07:18:46.365252  580468 pathmonitor.go:56] Checking path /usr/local/vgpu/containers/f41d2f6f-fc03-487f-8e7a-cf14ee5ec701_pytorch
unexpected fault address 0x7f74bc5ae73c
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f74bc5ae73c pc=0x16219b1]

goroutine 58 [running]:
runtime.throw({0x1a7037b?, 0x7f74bc48a000?})
    /usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0xc00028fb48 sp=0xc00028fb18 pc=0x448cfc
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:858 +0x116 fp=0xc00028fba8 sp=0xc00028fb48 pc=0x45ee36
main.mmapcachefile({0xc0007ec080, 0x72}, 0xc00028fd68)
    /k8s-vgpu/cmd/vGPUmonitor/cudevshr.go:146 +0x151 fp=0xc00028fc90 sp=0xc00028fba8 pc=0x16219b1
main.getvGPUMemoryInfo(0xc00028fd68)
    /k8s-vgpu/cmd/vGPUmonitor/cudevshr.go:154 +0x31 fp=0xc00028fcc0 sp=0xc00028fc90 pc=0x1621b51
main.checkfiles({0xc0007e6230, 0x47})
    /k8s-vgpu/cmd/vGPUmonitor/pathmonitor.go:79 +0x23b fp=0xc00028fd98 sp=0xc00028fcc0 pc=0x16255bb
main.monitorpath(0x12a05f200?)
    /k8s-vgpu/cmd/vGPUmonitor/pathmonitor.go:127 +0x3ba fp=0xc00028ff60 sp=0xc00028fd98 pc=0x1625cba
main.watchAndFeedback()
    /k8s-vgpu/cmd/vGPUmonitor/feedback.go:277 +0x91 fp=0xc00028ffe0 sp=0xc00028ff60 pc=0x16229f1
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00028ffe8 sp=0xc00028ffe0 pc=0x47afa1
created by main.main in goroutine 1
    /k8s-vgpu/cmd/vGPUmonitor/main.go:36 +0x137

2. Steps to reproduce the issue

I'm not sure why one of the cache file sizes is incorrect, maybe it's a version issue?

root@worker-a800-3:~# find /usr/local/vgpu/containers -name "*.cache" | xargs ls -al
-rw-rw-rw- 1 ubuntu ubuntu 1197897 May 24 06:37 /usr/local/vgpu/containers/5c8316be-0c88-479c-85fc-475f3874d05f_stable-diffusion/f2df89ca-28bd-4b23-83c0-212835ac1724.cache
-rw-rw-rw- 1 root   root   1197897 May 23 06:21 /usr/local/vgpu/containers/ac28ea1c-3f9e-40cd-ae86-034722d70f0f_dataset-loader/baaab5f6-b630-4895-bc3a-5f10a5a16cab.cache
-rw-rw-rw- 1 root   root    804681 May  8 09:23 /usr/local/vgpu/containers/f41d2f6f-fc03-487f-8e7a-cf14ee5ec701_pytorch/2aa77ef1-a879-4302-99a0-1206b30b234f.cache
-rw-rw-rw- 1 root   root   1197897 May 22 06:36 /usr/local/vgpu/containers/facba0d3-d961-4d83-a8c2-13fb4f648911_inference/b6ceaf0f-6399-460b-90a1-e7ba9fe459db.cache
-rw-rw-rw- 1 ubuntu users  1197897 May 23 09:57 /usr/local/vgpu/containers/fc54d454-00f4-4ee5-8f59-c33bca2630b0_kebe-dev/32acbce1-aa23-46bc-8096-e455d8d40de8.cache
-rw-rw-rw- 1 root   root   1197897 May 19 14:37 /usr/local/vgpu/containers/fce0818f-4ec3-44ad-b8c0-d498f75e184b_inference/15c69785-3ba4-4279-894c-f49611ef0269.cache

3. Information to attach (optional if deemed irrelevant)

Common error checking:

Additional information that might help better understand your environment and reproduce the bug:

chaunceyjiang commented 3 months ago

This is because in the previous version, after community discussion, an incompatible feature was introduced. https://github.com/Project-HAMi/HAMi-core/pull/4

We didn't handle this panic, simply to make it easier for users to identify the problem.

Restarting all Pods using vgpu can solve this problem.

kebe7jun commented 3 months ago

This is because in the previous version, after community discussion, an incompatible feature was introduced. Project-HAMi/HAMi-core#4

We didn't handle this panic, simply to make it easier for users to identify the problem.

Restarting all Pods using vgpu can solve this problem.

I understand, but I think we can actually provide a solution that is compatible with older versions instead of causing the entire monitoring component to panic, which may be confusing to someone who is not familiar with this project. Maybe if such problems arise in the future, we can't handle them in this way, right?