Closed p53 closed 3 months ago
The crash that I can see is a duplicate of #13140. This issue does give a good way of reproducing this problem though, so thank you.
Slow queries are not discussed in #13140.
i was checking it but there is different panic msg
Sorry, yes, it is a different panic. I'd be surprised if the root cause wasn't the same, the sqlite code is dealing with corrupted data.
yup i guess it will be related
@jiachengxu - tagging you to make sure you've seen this. If you're working on it maybe this can help you reproduce.
I can reproduce this with just putting enough workflows into a simple k3d single node cluster (started around 200 copies of examples/dag-diamond.yaml
) and calling argo list
. Occasionally that will crash in sqlite.
This stack trace implies we have a memory corruption problem in the server. Produced in the same way, using argo list
with many dag-diamond.yaml
(some running)
net.(*conn).Read(0xc0007b81e8, {0xc0009f4b00?, 0xc001501740?, 0xc002aecc38?})
/usr/local/go/src/net/net.go:179 +0x45 fp=0xc0015016d8 sp=0xc001501690 pc=0x5fe585
net.(*TCPConn).Read(0xc001501770?, {0xc0009f4b00?, 0xc002f14018?, 0x18?})
<autogenerated>:1 +0x25 fp=0xc001501708 sp=0xc0015016d8 pc=0x60f8c5
crypto/tls.(*atLeastReader).Read(0xc002f14018, {0xc0009f4b00?, 0xc002f14018?, 0x0?})
/usr/local/go/src/crypto/tls/conn.go:805 +0x3b fp=0xc001501750 sp=0xc001501708 pc=0x6567fb
bytes.(*Buffer).ReadFrom(0xc002aecd28, {0x3ce03a0, 0xc002f14018})
/usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc0015017a8 sp=0xc001501750 pc=0x51c9f8
crypto/tls.(*Conn).readFromUntil(0xc002aeca80, {0x3ce1aa0?, 0xc0007b81e8}, 0x580?)
/usr/local/go/src/crypto/tls/conn.go:827 +0xde fp=0xc0015017e8 sp=0xc0015017a8 pc=0x6569de
crypto/tls.(*Conn).readRecordOrCCS(0xc002aeca80, 0x0)
/usr/local/go/src/crypto/tls/conn.go:625 +0x250 fp=0xc001501b88 sp=0xc0015017e8 pc=0x653fb0
crypto/tls.(*Conn).readRecord(...)
/usr/local/go/src/crypto/tls/conn.go:587
crypto/tls.(*Conn).Read(0xc002aeca80, {0xc000980000, 0x8000, 0x1060100000000?})
/usr/local/go/src/crypto/tls/conn.go:1369 +0x158 fp=0xc001501bf8 sp=0xc001501b88 pc=0x65a278
github.com/soheilhy/cmux.(*bufferedReader).Read(0xc00017c010, {0xc000980000, 0xc001501c90?, 0x8000})
/go/pkg/mod/github.com/soheilhy/cmux@v0.1.5/buffer.go:53 +0x12f fp=0xc001501c48 sp=0xc001501bf8 pc=0x1f8812f
github.com/soheilhy/cmux.(*MuxConn).Read(0x0?, {0xc000980000?, 0xc001501ca0?, 0x45d10d?})
/go/pkg/mod/github.com/soheilhy/cmux@v0.1.5/cmux.go:297 +0x1e fp=0xc001501c78 sp=0xc001501c48 pc=0x1f8965e
bufio.(*Reader).Read(0xc0035ff980, {0xc0006da4a0, 0x9, 0xc1921ef224b271f3?})
/usr/local/go/src/bufio/bufio.go:244 +0x197 fp=0xc001501cb0 sp=0xc001501c78 pc=0x696c77
io.ReadAtLeast({0x3ce05c0, 0xc0035ff980}, {0xc0006da4a0, 0x9, 0x9}, 0x9)
/usr/local/go/src/io/io.go:335 +0x90 fp=0xc001501cf8 sp=0xc001501cb0 pc=0x4b9cf0
io.ReadFull(...)
/usr/local/go/src/io/io.go:354
golang.org/x/net/http2.readFrameHeader({0xc0006da4a0, 0x9, 0xc003120120?}, {0x3ce05c0?, 0xc0035ff980?})
/go/pkg/mod/golang.org/x/net@v0.23.0/http2/frame.go:237 +0x65 fp=0xc001501d48 sp=0xc001501cf8 pc=0x779945
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0006da460)
/go/pkg/mod/golang.org/x/net@v0.23.0/http2/frame.go:498 +0x85 fp=0xc001501df0 sp=0xc001501d48 pc=0x77a085
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc000d891e0, 0x1?)
/go/pkg/mod/google.golang.org/grpc@v1.59.0/internal/transport/http2_server.go:636 +0x145 fp=0xc001501f00 sp=0xc001501df0 pc=0xf84325
google.golang.org/grpc.(*Server).serveStreams(0xc00023e000, {0x3d1cf40?, 0xc000d891e0})
/go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:979 +0x1c2 fp=0xc001501f80 sp=0xc001501f00 pc=0xfd5702
google.golang.org/grpc.(*Server).handleRawConn.func1()
/go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:920 +0x45 fp=0xc001501fe0 sp=0xc001501f80 pc=0xfd4f65
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001501fe8 sp=0xc001501fe0 pc=0x4712e1
created by google.golang.org/grpc.(*Server).handleRawConn in goroutine 656
/go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:919 +0x185
goroutine 487 [select]:
runtime.gopark(0xc001505f90?, 0x2?, 0xe0?, 0x5d?, 0xc001505f1c?)
/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc001505db8 sp=0xc001505d98 pc=0x43e26e
runtime.selectgo(0xc001505f90, 0xc001505f18, 0xc0007c0180?, 0x0, 0xc0031288a0?, 0x1)
/usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc001505ed8 sp=0xc001505db8 pc=0x44e6a5
net/http.(*persistConn).writeLoop(0xc00178c120)
/usr/local/go/src/net/http/transport.go:2421 +0xe5 fp=0xc001505fc8 sp=0xc001505ed8 pc=0x72d605
net/http.(*Transport).dialConn.func6()
/usr/local/go/src/net/http/transport.go:1777 +0x25 fp=0xc001505fe0 sp=0xc001505fc8 pc=0x72a405
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001505fe8 sp=0xc001505fe0 pc=0x4712e1
created by net/http.(*Transport).dialConn in goroutine 517
/usr/local/go/src/net/http/transport.go:1777 +0x16f1
goroutine 486 [IO wait]:
runtime.gopark(0xbf97d9ec25bb9557?, 0xb?, 0x0?, 0x0?, 0xd?)
/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000ad15c8 sp=0xc000ad15a8 pc=0x43e26e
runtime.netpollblock(0x4c5158?, 0x407de6?, 0x0?)
/usr/local/go/src/runtime/netpoll.go:564 +0xf7 fp=0xc000ad1600 sp=0xc000ad15c8 pc=0x436cf7
internal/poll.runtime_pollWait(0x7fca5da77148, 0x72)
/usr/local/go/src/runtime/netpoll.go:343 +0x85 fp=0xc000ad1620 sp=0xc000ad1600 pc=0x46b905
internal/poll.(*pollDesc).wait(0xc0019ac680?, 0xc0009f4000?, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000ad1648 sp=0xc000ad1620 pc=0x4e2ec7
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0019ac680, {0xc0009f4000, 0x580, 0x580})
/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000ad16e0 sp=0xc000ad1648 pc=0x4e41ba
net.(*netFD).Read(0xc0019ac680, {0xc0009f4000?, 0xc0009f4005?, 0x3e6?})
/usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc000ad1728 sp=0xc000ad16e0 pc=0x5ec9a5
net.(*conn).Read(0xc0007b8128, {0xc0009f4000?, 0xc000295a01?, 0xc002aec538?})
/usr/local/go/src/net/net.go:179 +0x45 fp=0xc000ad1770 sp=0xc000ad1728 pc=0x5fe585
net.(*TCPConn).Read(0xc000ad1808?, {0xc0009f4000?, 0xc002f140d8?, 0x18?})
<autogenerated>:1 +0x25 fp=0xc000ad17a0 sp=0xc000ad1770 pc=0x60f8c5
crypto/tls.(*atLeastReader).Read(0xc002f140d8, {0xc0009f4000?, 0xc002f140d8?, 0x0?})
/usr/local/go/src/crypto/tls/conn.go:805 +0x3b fp=0xc000ad17e8 sp=0xc000ad17a0 pc=0x6567fb
bytes.(*Buffer).ReadFrom(0xc002aec628, {0x3ce03a0, 0xc002f140d8})
/usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc000ad1840 sp=0xc000ad17e8 pc=0x51c9f8
crypto/tls.(*Conn).readFromUntil(0xc002aec380, {0x3ce1aa0?, 0xc0007b8128}, 0x580?)
/usr/local/go/src/crypto/tls/conn.go:827 +0xde fp=0xc000ad1880 sp=0xc000ad1840 pc=0x6569de
crypto/tls.(*Conn).readRecordOrCCS(0xc002aec380, 0x0)
/usr/local/go/src/crypto/tls/conn.go:625 +0x250 fp=0xc000ad1c20 sp=0xc000ad1880 pc=0x653fb0
crypto/tls.(*Conn).readRecord(...)
/usr/local/go/src/crypto/tls/conn.go:587
crypto/tls.(*Conn).Read(0xc002aec380, {0xc00098a000, 0x1000, 0xd?})
/usr/local/go/src/crypto/tls/conn.go:1369 +0x158 fp=0xc000ad1c90 sp=0xc000ad1c20 pc=0x65a278
net/http.(*persistConn).Read(0xc00178c120, {0xc00098a000?, 0xc000868540?, 0xc000ad1d38?})
/usr/local/go/src/net/http/transport.go:1954 +0x4a fp=0xc000ad1cf0 sp=0xc000ad1c90 pc=0x72ae4a
bufio.(*Reader).fill(0xc0013c1380)
/usr/local/go/src/bufio/bufio.go:113 +0x103 fp=0xc000ad1d28 sp=0xc000ad1cf0 pc=0x696743
bufio.(*Reader).Peek(0xc0013c1380, 0x1)
/usr/local/go/src/bufio/bufio.go:151 +0x53 fp=0xc000ad1d48 sp=0xc000ad1d28 pc=0x696873
net/http.(*persistConn).readLoop(0xc00178c120)
/usr/local/go/src/net/http/transport.go:2118 +0x1b9 fp=0xc000ad1fc8 sp=0xc000ad1d48 pc=0x72bc39
net/http.(*Transport).dialConn.func5()
/usr/local/go/src/net/http/transport.go:1776 +0x25 fp=0xc000ad1fe0 sp=0xc000ad1fc8 pc=0x72a465
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000ad1fe8 sp=0xc000ad1fe0 pc=0x4712e1
created by net/http.(*Transport).dialConn in goroutine 517
/usr/local/go/src/net/http/transport.go:1776 +0x169f
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
we have several hundred workflows in our environment, doing listing workflows 20 req/s to check memory utilization i am getting container restarts with panic for argo-server pod, prior to this i see slow query warnings argo-trace.zip
Version
v3.5.7
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container