Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
956 stars 197 forks source link

vgpu-scheduler-extender terminated with exit code 2 #584

Open jeonghyunkeem opened 3 weeks ago

jeonghyunkeem commented 3 weeks ago

What happened: vgpu-scheduler-extender container (part of hami-scheduler pod) keeps terminated with exit code 2.

What you expected to happen: vgpu-scheduler-extender stays alive without termination

How to reproduce it (as minimally and precisely as possible): I'm not sure as it happens randomly

Anything else we need to know?:

I'm using multiple gpu nodes in my cluster and each node has hami.io/node-nvidia-register annotation as follows:

hami.io/node-nvidia-register=GPU-80c9c145-7ed8-5261-305e-72044d835856,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-7cbd3046-f3e2-dbc2-95dd-a77b1de5639f,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-eac3c055-a9e3-f967-5255-cb1234c78133,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-8f8d7649-0174-5b7a-4499-c93f4f4c1301,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-1a9db261-1fc5-0b0f-da59-636a3e97850b,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-53e5a370-8700-37f5-f10a-00c8ff829794,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-4dcb8085-4e5e-0462-9d96-794c903503ce,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-ea6f48d4-12de-4481-4fb5-883341efecf4,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:

here are the final logs of terminated vgpu-scheduler-extender container:

I1031 00:52:46.362711       1 util.go:146] Encoded container Devices: GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:
I1031 00:52:46.362717       1 util.go:146] Encoded container Devices: GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:
I1031 00:52:46.362722       1 util.go:169] Encoded pod single devices GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:;GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:;
fatal error: concurrent map iteration and map write

goroutine 150 [running]:
github.com/Project-HAMi/HAMi/pkg/scheduler.(*Scheduler).getNodesUsage(0xc000658000, 0xc00a3c9b40, 0x0)
    /k8s-vgpu/pkg/scheduler/scheduler.go:301 +0x356
github.com/Project-HAMi/HAMi/pkg/scheduler.(*Scheduler).RegisterFromNodeAnnotations(0xc000658000)
    /k8s-vgpu/pkg/scheduler/scheduler.go:244 +0x2c5
created by main.start in goroutine 1
    /k8s-vgpu/cmd/scheduler/main.go:75 +0xe5

goroutine 1 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f0d74613eb0, 0x72)
    /usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0x3?, 0x1?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc008737b80)
    /usr/local/go/src/internal/poll/fd_unix.go:611 +0x2ac
net.(*netFD).accept(0xc008737b80)
    /usr/local/go/src/net/fd_unix.go:172 +0x29
net.(*TCPListener).accept(0xc008ae0da0)
    /usr/local/go/src/net/tcpsock_posix.go:159 +0x1e
net.(*TCPListener).Accept(0xc008ae0da0)
    /usr/local/go/src/net/tcpsock.go:327 +0x30
crypto/tls.(*listener).Accept(0xc00884f998)
    /usr/local/go/src/crypto/tls/tls.go:66 +0x27
net/http.(*Server).Serve(0xc008aec000, {0x1d342a8, 0xc00884f998})
    /usr/local/go/src/net/http/server.go:3260 +0x33e
net/http.(*Server).ServeTLS(0xc008aec000, {0x1d34518, 0xc008ae0da0}, {0x7ffe446421da, 0xc}, {0x7ffe446421f2, 0xc})
    /usr/local/go/src/net/http/server.go:3330 +0x486
net/http.(*Server).ListenAndServeTLS(0xc008aec000, {0x7ffe446421da, 0xc}, {0x7ffe446421f2, 0xc})
    /usr/local/go/src/net/http/server.go:3487 +0x125
net/http.ListenAndServeTLS(...)
    /usr/local/go/src/net/http/server.go:3453
main.start()
    /k8s-vgpu/cmd/scheduler/main.go:90 +0x52d
main.init.func1(0xc000436100?, {0x1a9d158?, 0x4?, 0x1a9d15c?})
    /k8s-vgpu/cmd/scheduler/main.go:45 +0xf
github.com/spf13/cobra.(*Command).execute(0x2a757c0, {0xc000202a90, 0x1b, 0x1b})
    /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0x2a757c0)
    /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
    /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main()
    /k8s-vgpu/cmd/scheduler/main.go:97 +0x1e

goroutine 143 [chan receive]:
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:969 +0x4b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000029f70, {0x1d28440, 0xc0003c4210}, 0x1, 0xc000620060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00049e770, 0x3b9aca00, 0x0, 0x1, 0xc000620060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002e1cb0)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:968 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 195
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 127 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc000444188, 0xb733)
    /usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0xc004ef56c0?)
    /usr/local/go/src/sync/cond.go:70 +0x85
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000444160, 0xc00061a070)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/delta_fifo.go:575 +0x236
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc000440460)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:188 +0x30
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0001b6e48, {0x1d28440, 0xc0005b01e0}, 0x1, 0xc00051e060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0001b6e48, 0x3b9aca00, 0x0, 0x1, 0xc00051e060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*controller).Run(0xc000440460, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:159 +0x35e
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc0002dcdc0, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:504 +0x2c8
k8s.io/client-go/informers.(*sharedInformerFactory).Start.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/informers/factory.go:150 +0x5c
created by k8s.io/client-go/informers.(*sharedInformerFactory).Start in goroutine 1
    /go/pkg/mod/k8s.io/client-go@v0.28.3/informers/factory.go:148 +0x205

goroutine 128 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0002dd158, 0xbb157)
    /usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0xc005f1b8a0?)
    /usr/local/go/src/sync/cond.go:70 +0x85
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc0002dd130, 0xc00044e5a0)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/delta_fifo.go:575 +0x236
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0003d2f00)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:188 +0x30
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00002de48, {0x1d28440, 0xc000615440}, 0x1, 0xc00051e060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00002de48, 0x3b9aca00, 0x0, 0x1, 0xc00051e060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*controller).Run(0xc0003d2f00, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:159 +0x35e
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc0002dd080, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:504 +0x2c8
k8s.io/client-go/informers.(*sharedInformerFactory).Start.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/informers/factory.go:150 +0x5c
created by k8s.io/client-go/informers.(*sharedInformerFactory).Start in goroutine 1
    /go/pkg/mod/k8s.io/client-go@v0.28.3/informers/factory.go:148 +0x205

goroutine 148 [chan receive]:
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:969 +0x4b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc008d0ff70, {0x1d28440, 0xc0002e39b0}, 0x1, 0xc000342000)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00049ff70, 0x3b9aca00, 0x0, 0x1, 0xc000342000)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002e1b90)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:968 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 179
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 112 [IO wait]:
internal/poll.runtime_pollWait(0x7f0d74613db8, 0x72)
    /usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc0003de080?, 0xc008e76000?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0003de080, {0xc008e76000, 0xa000, 0xa000})
    /usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc0003de080, {0xc008e76000?, 0x7f0d6450f588?, 0xc0025ebcb0?})
    /usr/local/go/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc000610038, {0xc008e76000?, 0xc00002a938?, 0x4136bb?})
    /usr/local/go/src/net/net.go:185 +0x45
crypto/tls.(*atLeastReader).Read(0xc0025ebcb0, {0xc008e76000?, 0x0?, 0xc0025ebcb0?})
    /usr/local/go/src/crypto/tls/conn.go:806 +0x3b
bytes.(*Buffer).ReadFrom(0xc00024c630, {0x1d26d40, 0xc0025ebcb0})
    /usr/local/go/src/bytes/buffer.go:211 +0x98
crypto/tls.(*Conn).readFromUntil(0xc00024c388, {0x1d270c0, 0xc000610038}, 0xc00002a980?)
    /usr/local/go/src/crypto/tls/conn.go:828 +0xde
crypto/tls.(*Conn).readRecordOrCCS(0xc00024c388, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:626 +0x3cf
crypto/tls.(*Conn).readRecord(...)
    /usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc00024c388, {0xc0005cf000, 0x1000, 0x94c471?})
    /usr/local/go/src/crypto/tls/conn.go:1370 +0x156
bufio.(*Reader).Read(0xc0005c89c0, {0xc0005c42e0, 0x9, 0x0?})
    /usr/local/go/src/bufio/bufio.go:241 +0x197
io.ReadAtLeast({0x1d262a0, 0xc0005c89c0}, {0xc0005c42e0, 0x9, 0x9}, 0x9)
    /usr/local/go/src/io/io.go:335 +0x90
io.ReadFull(...)
    /usr/local/go/src/io/io.go:354
golang.org/x/net/http2.readFrameHeader({0xc0005c42e0, 0x9, 0x2adc0?}, {0x1d262a0?, 0xc0005c89c0?})
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/frame.go:237 +0x65
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0005c42a0)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/frame.go:501 +0x85
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc00002afa8)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:2358 +0xda
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0002d4180)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:2254 +0x8b
created by golang.org/x/net/http2.(*Transport).newClientConn in goroutine 111
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:869 +0xd1b

goroutine 179 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*sharedProcessor).run(0xc00062e460, 0xc00051e120)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:803 +0x4d
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run.(*Group).StartWithChannel.func4()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 127
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 180 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*controller).Run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:132 +0x25
created by k8s.io/client-go/tools/cache.(*controller).Run in goroutine 127
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:131 +0xa9

goroutine 195 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*sharedProcessor).run(0xc00062e4b0, 0xc000216660)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:803 +0x4d
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run.(*Group).StartWithChannel.func4()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 128
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 196 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*controller).Run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:132 +0x25
created by k8s.io/client-go/tools/cache.(*controller).Run in goroutine 128
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/controller.go:131 +0xa9

goroutine 197 [select]:
k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0x2a8c200?}, {0x1d2dde8, 0xc006ca99c0}, {0x7f0d743423c0, 0xc0002dd130}, {0x1d559a8, 0x1a6d060}, 0x0, ...)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:714 +0x187
k8s.io/client-go/tools/cache.(*Reflector).watch(0xc0002b0380, {0x0?, 0x0?}, 0xc00051e060, 0xc000380120)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:433 +0x545
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0002b0380, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:358 +0x377
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:291 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x2a8c900?)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005bbf50, {0x1d28460, 0xc00062e5a0}, 0x1, 0xc00051e060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0002b0380, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:290 +0x1c5
k8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 128
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 181 [select]:
k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0x2a8c200?}, {0x1d2dde8, 0xc008eb4400}, {0x7f0d743423c0, 0xc000444160}, {0x1d559a8, 0x1a6db40}, 0x0, ...)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:714 +0x187
k8s.io/client-go/tools/cache.(*Reflector).watch(0xc00052a000, {0x0?, 0x0?}, 0xc00051e060, 0xc007d9daa0)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:433 +0x545
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc00052a000, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:358 +0x377
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:291 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x2a8c900?)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc008637f50, {0x1d28460, 0xc0003b6370}, 0x1, 0xc00051e060)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc00052a000, 0xc00051e060)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:290 +0x1c5
k8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 127
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 144 [select]:
k8s.io/client-go/tools/cache.(*processorListener).pop(0xc0002e1cb0)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:939 +0x107
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 195
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 149 [select]:
k8s.io/client-go/tools/cache.(*processorListener).pop(0xc0002e1b90)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/shared_informer.go:939 +0x107
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 179
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/wait/wait.go:70 +0x73

goroutine 201 [select, 12 minutes]:
k8s.io/client-go/tools/cache.(*Reflector).startResync(0xc0002b0380, 0xc00051e060, 0xc0006211a0, 0xc000380120)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:370 +0x10f
created by k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch in goroutine 197
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:357 +0x34d

goroutine 226 [select, 12 minutes]:
k8s.io/client-go/tools/cache.(*Reflector).startResync(0xc00052a000, 0xc00051e060, 0xc0080f9b60, 0xc007d9daa0)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:370 +0x10f
created by k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch in goroutine 181
    /go/pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:357 +0x34d

goroutine 151 [IO wait, 2171 minutes]:
internal/poll.runtime_pollWait(0x7f0d74613cc0, 0x72)
    /usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0x8?, 0x10?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00879e380)
    /usr/local/go/src/internal/poll/fd_unix.go:611 +0x2ac
net.(*netFD).accept(0xc00879e380)
    /usr/local/go/src/net/fd_unix.go:172 +0x29
net.(*TCPListener).accept(0xc0089fed60)
    /usr/local/go/src/net/tcpsock_posix.go:159 +0x1e
net.(*TCPListener).Accept(0xc0089fed60)
    /usr/local/go/src/net/tcpsock.go:327 +0x30
net/http.(*Server).Serve(0xc008a2a000, {0x1d34518, 0xc0089fed60})
    /usr/local/go/src/net/http/server.go:3260 +0x33e
net/http.(*Server).ListenAndServe(0xc008a2a000)
    /usr/local/go/src/net/http/server.go:3189 +0x71
net/http.ListenAndServe(...)
    /usr/local/go/src/net/http/server.go:3443
main.initMetrics({0x7ffe44642236, 0x5})
    /k8s-vgpu/cmd/scheduler/metrics.go:239 +0x225
created by main.start in goroutine 1
    /k8s-vgpu/cmd/scheduler/main.go:76 +0x14b

goroutine 348153 [select]:
net/http.(*http2serverConn).serve(0xc0096db040)
    /usr/local/go/src/net/http/h2_bundle.go:4757 +0x897
net/http.(*http2Server).ServeConn(0xc008ae6fa0, {0x1d482b8, 0xc009e82a88}, 0xc009da9b30)
    /usr/local/go/src/net/http/h2_bundle.go:4345 +0xbad
net/http.http2ConfigureServer.func1(0xc008aec000, 0xc009e82a88, {0x1d266c0, 0xc009e7db80})
    /usr/local/go/src/net/http/h2_bundle.go:4135 +0x125
net/http.(*conn).serve(0xc009e850e0, {0x1d41a90, 0xc008aded80})
    /usr/local/go/src/net/http/server.go:1952 +0x12f3
created by net/http.(*Server).Serve in goroutine 1
    /usr/local/go/src/net/http/server.go:3290 +0x4b4

goroutine 347327 [select, 6 minutes]:
golang.org/x/net/http2.(*clientStream).writeRequest(0xc001ff6180, 0xc00733c6c0, 0x0)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:1536 +0xa85
golang.org/x/net/http2.(*clientStream).doRequest(0xc001ff6180, 0x0?, 0xc0056e6480?)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:1414 +0x56
created by golang.org/x/net/http2.(*ClientConn).roundTrip in goroutine 181
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:1319 +0x3e5

goroutine 347328 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc001ff61c8, 0x6f)
    /usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0x1a?)
    /usr/local/go/src/sync/cond.go:70 +0x85
golang.org/x/net/http2.(*pipe).Read(0xc001ff61b0, {0xc0006e0001, 0x7dff, 0x7dff})
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/pipe.go:76 +0xdf
golang.org/x/net/http2.transportResponseBody.Read({0x374f?}, {0xc0006e0001?, 0xc00a3cdce0?, 0x4136bb?})
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:2641 +0x65
encoding/json.(*Decoder).refill(0xc0065788c0)
    /usr/local/go/src/encoding/json/stream.go:165 +0x188
encoding/json.(*Decoder).readValue(0xc0065788c0)
    /usr/local/go/src/encoding/json/stream.go:140 +0x85
encoding/json.(*Decoder).Decode(0xc0065788c0, {0x185f3c0, 0xc002d479b0})
    /usr/local/go/src/encoding/json/stream.go:63 +0x75
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc009184570, {0xc0026d8000, 0x8000, 0xa000})
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/framer/framer.go:152 +0x19c
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc00393c6e0, 0x0, {0x1d2d050, 0xc007adb380})
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/runtime/serializer/streaming/streaming.go:77 +0xa3
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc0063126c0)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/rest/watch/decoder.go:49 +0x4b
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc008eb4400)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/watch/streamwatcher.go:105 +0xdb
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher in goroutine 181
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/watch/streamwatcher.go:76 +0x105

goroutine 348319 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0007a6948, 0xc7)
    /usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0x1a?)
    /usr/local/go/src/sync/cond.go:70 +0x85
golang.org/x/net/http2.(*pipe).Read(0xc0007a6930, {0xc00342e001, 0x7dff, 0x7dff})
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/pipe.go:76 +0xdf
golang.org/x/net/http2.transportResponseBody.Read({0x5893?}, {0xc00342e001?, 0xc000067ce0?, 0x4136bb?})
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:2641 +0x65
encoding/json.(*Decoder).refill(0xc0002f3180)
    /usr/local/go/src/encoding/json/stream.go:165 +0x188
encoding/json.(*Decoder).readValue(0xc0002f3180)
    /usr/local/go/src/encoding/json/stream.go:140 +0x85
encoding/json.(*Decoder).Decode(0xc0002f3180, {0x185f3c0, 0xc002d95410})
    /usr/local/go/src/encoding/json/stream.go:63 +0x75
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc005514510, {0xc00344c000, 0x8000, 0xa000})
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/util/framer/framer.go:152 +0x19c
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc00341afa0, 0x0, {0x1d2d050, 0xc0096cf480})
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/runtime/serializer/streaming/streaming.go:77 +0xa3
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00494bee0)
    /go/pkg/mod/k8s.io/client-go@v0.28.3/rest/watch/decoder.go:49 +0x4b
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc006ca99c0)
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/watch/streamwatcher.go:105 +0xdb
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher in goroutine 197
    /go/pkg/mod/k8s.io/apimachinery@v0.28.3/pkg/watch/streamwatcher.go:76 +0x105

goroutine 347503 [IO wait]:
internal/poll.runtime_pollWait(0x7f0d74613120, 0x72)
    /usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc007e3cf80?, 0xc004a84a00?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc007e3cf80, {0xc004a84a00, 0x2500, 0x2500})
    /usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc007e3cf80, {0xc004a84a00?, 0x7f0d6450f588?, 0xc0025ebc80?})
    /usr/local/go/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc009c634d0, {0xc004a84a00?, 0xc003ec9788?, 0x4136bb?})
    /usr/local/go/src/net/net.go:185 +0x45
crypto/tls.(*atLeastReader).Read(0xc0025ebc80, {0xc004a84a00?, 0x0?, 0xc0025ebc80?})
    /usr/local/go/src/crypto/tls/conn.go:806 +0x3b
bytes.(*Buffer).ReadFrom(0xc0044bd7b0, {0x1d26d40, 0xc0025ebc80})
    /usr/local/go/src/bytes/buffer.go:211 +0x98
crypto/tls.(*Conn).readFromUntil(0xc0044bd508, {0x1d270c0, 0xc009c634d0}, 0xc003ec97d0?)
    /usr/local/go/src/crypto/tls/conn.go:828 +0xde
crypto/tls.(*Conn).readRecordOrCCS(0xc0044bd508, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:626 +0x3cf
crypto/tls.(*Conn).readRecord(...)
    /usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc0044bd508, {0xc0044f4000, 0x1000, 0x0?})
    /usr/local/go/src/crypto/tls/conn.go:1370 +0x156
net/http.(*connReader).Read(0xc003f87680, {0xc0044f4000, 0x1000, 0x1000})
    /usr/local/go/src/net/http/server.go:789 +0x14b
bufio.(*Reader).fill(0xc0043a42a0)
    /usr/local/go/src/bufio/bufio.go:110 +0x103
bufio.(*Reader).Peek(0xc0043a42a0, 0x4)
    /usr/local/go/src/bufio/bufio.go:148 +0x53
net/http.(*conn).serve(0xc0044d50e0, {0x1d41a90, 0xc008aded80})
    /usr/local/go/src/net/http/server.go:2079 +0x749
created by net/http.(*Server).Serve in goroutine 1
    /usr/local/go/src/net/http/server.go:3290 +0x4b4

goroutine 348156 [IO wait]:
internal/poll.runtime_pollWait(0x7f0d74613bc8, 0x72)
    /usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc000c21e80?, 0xc003df0000?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000c21e80, {0xc003df0000, 0x6000, 0x6000})
    /usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc000c21e80, {0xc003df0000?, 0x7f0d745a63d8?, 0xc0007c5ab8?})
    /usr/local/go/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc009e13420, {0xc003df0000?, 0xc000026a58?, 0x4136bb?})
    /usr/local/go/src/net/net.go:185 +0x45
crypto/tls.(*atLeastReader).Read(0xc0007c5ab8, {0xc003df0000?, 0x0?, 0xc0007c5ab8?})
    /usr/local/go/src/crypto/tls/conn.go:806 +0x3b
bytes.(*Buffer).ReadFrom(0xc009e82d30, {0x1d26d40, 0xc0007c5ab8})
    /usr/local/go/src/bytes/buffer.go:211 +0x98
crypto/tls.(*Conn).readFromUntil(0xc009e82a88, {0x1d270c0, 0xc009e13420}, 0xc000026aa0?)
    /usr/local/go/src/crypto/tls/conn.go:828 +0xde
crypto/tls.(*Conn).readRecordOrCCS(0xc009e82a88, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:626 +0x3cf
crypto/tls.(*Conn).readRecord(...)
    /usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc009e82a88, {0xc001de3ee0, 0x9, 0x453186?})
    /usr/local/go/src/crypto/tls/conn.go:1370 +0x156
io.ReadAtLeast({0x7f0d74555118, 0xc009e82a88}, {0xc001de3ee0, 0x9, 0x9}, 0x9)
    /usr/local/go/src/io/io.go:335 +0x90
io.ReadFull(...)
    /usr/local/go/src/io/io.go:354
net/http.http2readFrameHeader({0xc001de3ee0, 0x9, 0x0?}, {0x7f0d74555118?, 0xc009e82a88?})
    /usr/local/go/src/net/http/h2_bundle.go:1638 +0x65
net/http.(*http2Framer).ReadFrame(0xc001de3ea0)
    /usr/local/go/src/net/http/h2_bundle.go:1905 +0x85
net/http.(*http2serverConn).readFrames(0xc0096db040)
    /usr/local/go/src/net/http/h2_bundle.go:4637 +0x87
created by net/http.(*http2serverConn).serve in goroutine 348153
    /usr/local/go/src/net/http/h2_bundle.go:4749 +0x56a

goroutine 348448 [runnable]:
fmt.(*pp).printArg(0xc005e90000?, {0x1742900?, 0xc009d18320?}, 0x73?)
    /usr/local/go/src/fmt/print.go:681 +0x5bd
fmt.(*pp).doPrintf(0xc005e90000, {0x1af4138, 0x37}, {0xc001d6e6c0, 0x4, 0x4})
    /usr/local/go/src/fmt/print.go:1075 +0x37e
fmt.Fprintf({0x1d26ac0, 0xc0096bc8c0}, {0x1af4138, 0x37}, {0xc001d6e6c0, 0x4, 0x4})
    /usr/local/go/src/fmt/print.go:224 +0x71
k8s.io/klog/v2.(*loggingT).printfDepth(0x2a8c900, 0x0, 0x0, {0x0, 0x0}, 0x1, {0x1af4138, 0x37}, {0xc001d6e6c0, 0x4, ...})
    /go/pkg/mod/k8s.io/klog/v2@v2.120.1/klog.go:763 +0x165
k8s.io/klog/v2.(*loggingT).printf(...)
    /go/pkg/mod/k8s.io/klog/v2@v2.120.1/klog.go:744
k8s.io/klog/v2.Infof(...)
    /go/pkg/mod/k8s.io/klog/v2@v2.120.1/klog.go:1525
github.com/Project-HAMi/HAMi/pkg/scheduler.(*podManager).addPod(0xc000658020, 0xc001100908, {0xc00869e080, 0xe}, 0xc005013980)
    /k8s-vgpu/pkg/scheduler/pods.go:63 +0x338
github.com/Project-HAMi/HAMi/pkg/scheduler.(*Scheduler).Filter(0xc000658000, {0xc001100908?, 0x0?, 0xc000ad40c0?})
    /k8s-vgpu/pkg/scheduler/scheduler.go:486 +0xb38
github.com/Project-HAMi/HAMi/pkg/scheduler/routes.PredicateRoute.func1({0x1d341b8, 0xc0093ff138}, 0xc006228000, {0x0?, 0x0?, 0x0?})
    /k8s-vgpu/pkg/scheduler/routes/route.go:59 +0x33b
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc00884dbc0, {0x1d341b8, 0xc0093ff138}, 0xc006228000)
    /go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0x7eb
net/http.serverHandler.ServeHTTP({0xc00509e480?}, {0x1d341b8?, 0xc0093ff138?}, 0xc0002d42a0?)
    /usr/local/go/src/net/http/server.go:3142 +0x8e
net/http.initALPNRequest.ServeHTTP({{0x1d41a90?, 0xc009eb8300?}, 0xc009e82a88?, {0xc008aec000?}}, {0x1d341b8, 0xc0093ff138}, 0xc006228000)
    /usr/local/go/src/net/http/server.go:3750 +0x231
net/http.(*http2serverConn).runHandler(0x952ba8?, 0xc007482900?, 0x0?, 0xc00a9037d0?)
    /usr/local/go/src/net/http/h2_bundle.go:6192 +0xbb
created by net/http.(*http2serverConn).scheduleHandler in goroutine 348153
    /usr/local/go/src/net/http/h2_bundle.go:6127 +0x21d

goroutine 348318 [select]:
golang.org/x/net/http2.(*clientStream).writeRequest(0xc0007a6900, 0xc0063ccd80, 0x0)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:1536 +0xa85
golang.org/x/net/http2.(*clientStream).doRequest(0xc0007a6900, 0x6ea845?, 0xc009971830?)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:1414 +0x56
created by golang.org/x/net/http2.(*ClientConn).roundTrip in goroutine 197
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/transport.go:1319 +0x3e5

Environment:

Nimbus318 commented 3 weeks ago

Could you please provide the exact hami image version to help trace the specific code line? It currently appears that certain map-type fields in the scheduler might be accessed concurrently without locks, causing a fatal error: concurrent map iteration and map write

jeonghyunkeem commented 3 weeks ago

@Nimbus318 vgpu-scheduler-extender uses a following image: projecthami/hami:v2.3.13

Nimbus318 commented 3 weeks ago

@jeonghyunkeem Got it, I checked, and I know where the problem is. This issue has already been fixed in #418, so it should no longer occur if you use the latest version, 2.4.0.

jeonghyunkeem commented 2 weeks ago

@Nimbus318 Thanks. I'll test v2.4.0 and close this issue if it works.