argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.91k stars 3.18k forks source link

3.5.7 Server keeps restarting, panicking #13154

Closed p53 closed 3 months ago

p53 commented 3 months ago

Pre-requisites

What happened/what did you expect to happen?

we have several hundred workflows in our environment, doing listing workflows 20 req/s to check memory utilization i am getting container restarts with panic for argo-server pod, prior to this i see slow query warnings argo-trace.zip

Version

v3.5.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

any simple workflow, create 1000 workflows and try to list 20req/s e.g. with firefox tab reloader

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
Joibel commented 3 months ago

The crash that I can see is a duplicate of #13140. This issue does give a good way of reproducing this problem though, so thank you.

Slow queries are not discussed in #13140.

p53 commented 3 months ago

i was checking it but there is different panic msg

Joibel commented 3 months ago

Sorry, yes, it is a different panic. I'd be surprised if the root cause wasn't the same, the sqlite code is dealing with corrupted data.

p53 commented 3 months ago

yup i guess it will be related

Joibel commented 3 months ago

@jiachengxu - tagging you to make sure you've seen this. If you're working on it maybe this can help you reproduce.

Joibel commented 3 months ago

I can reproduce this with just putting enough workflows into a simple k3d single node cluster (started around 200 copies of examples/dag-diamond.yaml) and calling argo list. Occasionally that will crash in sqlite.

Joibel commented 3 months ago

This stack trace implies we have a memory corruption problem in the server. Produced in the same way, using argo list with many dag-diamond.yaml (some running)

net.(*conn).Read(0xc0007b81e8, {0xc0009f4b00?, 0xc001501740?, 0xc002aecc38?})                                                                                                                                                                  
    /usr/local/go/src/net/net.go:179 +0x45 fp=0xc0015016d8 sp=0xc001501690 pc=0x5fe585                                                                                                                                                         
net.(*TCPConn).Read(0xc001501770?, {0xc0009f4b00?, 0xc002f14018?, 0x18?})                                                                                                                                                                      
    <autogenerated>:1 +0x25 fp=0xc001501708 sp=0xc0015016d8 pc=0x60f8c5                                                                                                                                                                        
crypto/tls.(*atLeastReader).Read(0xc002f14018, {0xc0009f4b00?, 0xc002f14018?, 0x0?})                                                                                                                                                           
    /usr/local/go/src/crypto/tls/conn.go:805 +0x3b fp=0xc001501750 sp=0xc001501708 pc=0x6567fb                                                                                                                                                 
bytes.(*Buffer).ReadFrom(0xc002aecd28, {0x3ce03a0, 0xc002f14018})                                                                                                                                                                              
    /usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc0015017a8 sp=0xc001501750 pc=0x51c9f8                                                                                                                                                    
crypto/tls.(*Conn).readFromUntil(0xc002aeca80, {0x3ce1aa0?, 0xc0007b81e8}, 0x580?)                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:827 +0xde fp=0xc0015017e8 sp=0xc0015017a8 pc=0x6569de                                                                                                                                                 
crypto/tls.(*Conn).readRecordOrCCS(0xc002aeca80, 0x0)                                                                                                                                                                                          
    /usr/local/go/src/crypto/tls/conn.go:625 +0x250 fp=0xc001501b88 sp=0xc0015017e8 pc=0x653fb0                                                                                                                                                
crypto/tls.(*Conn).readRecord(...)                                                                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:587                                                                                                                                                                                                   
crypto/tls.(*Conn).Read(0xc002aeca80, {0xc000980000, 0x8000, 0x1060100000000?})                                                                                                                                                                
    /usr/local/go/src/crypto/tls/conn.go:1369 +0x158 fp=0xc001501bf8 sp=0xc001501b88 pc=0x65a278                                                                                                                                               
github.com/soheilhy/cmux.(*bufferedReader).Read(0xc00017c010, {0xc000980000, 0xc001501c90?, 0x8000})                                                                                                                                           
    /go/pkg/mod/github.com/soheilhy/cmux@v0.1.5/buffer.go:53 +0x12f fp=0xc001501c48 sp=0xc001501bf8 pc=0x1f8812f                                                                                                                               
github.com/soheilhy/cmux.(*MuxConn).Read(0x0?, {0xc000980000?, 0xc001501ca0?, 0x45d10d?})                                                                                                                                                      
    /go/pkg/mod/github.com/soheilhy/cmux@v0.1.5/cmux.go:297 +0x1e fp=0xc001501c78 sp=0xc001501c48 pc=0x1f8965e                                                                                                                                 
bufio.(*Reader).Read(0xc0035ff980, {0xc0006da4a0, 0x9, 0xc1921ef224b271f3?})                                                                                                                                                                   
    /usr/local/go/src/bufio/bufio.go:244 +0x197 fp=0xc001501cb0 sp=0xc001501c78 pc=0x696c77                                                                                                                                                    
io.ReadAtLeast({0x3ce05c0, 0xc0035ff980}, {0xc0006da4a0, 0x9, 0x9}, 0x9)                                                                                                                                                                       
    /usr/local/go/src/io/io.go:335 +0x90 fp=0xc001501cf8 sp=0xc001501cb0 pc=0x4b9cf0                                                                                                                                                           
io.ReadFull(...)                                                                                                                                                                                                                               
    /usr/local/go/src/io/io.go:354                                                                                                                                                                                                             
golang.org/x/net/http2.readFrameHeader({0xc0006da4a0, 0x9, 0xc003120120?}, {0x3ce05c0?, 0xc0035ff980?})                                                                                                                                        
    /go/pkg/mod/golang.org/x/net@v0.23.0/http2/frame.go:237 +0x65 fp=0xc001501d48 sp=0xc001501cf8 pc=0x779945                                                                                                                                  
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0006da460)                                                                                                                                                                                       
    /go/pkg/mod/golang.org/x/net@v0.23.0/http2/frame.go:498 +0x85 fp=0xc001501df0 sp=0xc001501d48 pc=0x77a085                                                                                                                                  
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc000d891e0, 0x1?)                                                                                                                                                     
    /go/pkg/mod/google.golang.org/grpc@v1.59.0/internal/transport/http2_server.go:636 +0x145 fp=0xc001501f00 sp=0xc001501df0 pc=0xf84325                                                                                                       
google.golang.org/grpc.(*Server).serveStreams(0xc00023e000, {0x3d1cf40?, 0xc000d891e0})                                                                                                                                                        
    /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:979 +0x1c2 fp=0xc001501f80 sp=0xc001501f00 pc=0xfd5702                                                                                                                                
google.golang.org/grpc.(*Server).handleRawConn.func1()                                                                                                                                                                                         
    /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:920 +0x45 fp=0xc001501fe0 sp=0xc001501f80 pc=0xfd4f65                                                                                                                                 
runtime.goexit()                                                                                                                                                                                                                               
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001501fe8 sp=0xc001501fe0 pc=0x4712e1                                                                                                                                                
created by google.golang.org/grpc.(*Server).handleRawConn in goroutine 656                                                                                                                                                                     
    /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:919 +0x185                                                                                                                                                                            

goroutine 487 [select]:                                                                                                                                                                                                                        
runtime.gopark(0xc001505f90?, 0x2?, 0xe0?, 0x5d?, 0xc001505f1c?)                                                                                                                                                                               
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc001505db8 sp=0xc001505d98 pc=0x43e26e                                                                                                                                                    
runtime.selectgo(0xc001505f90, 0xc001505f18, 0xc0007c0180?, 0x0, 0xc0031288a0?, 0x1)                                                                                                                                                           
    /usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc001505ed8 sp=0xc001505db8 pc=0x44e6a5                                                                                                                                                 
net/http.(*persistConn).writeLoop(0xc00178c120)                                                                                                                                                                                                
    /usr/local/go/src/net/http/transport.go:2421 +0xe5 fp=0xc001505fc8 sp=0xc001505ed8 pc=0x72d605                                                                                                                                             
net/http.(*Transport).dialConn.func6()                                                                                                                                                                                                         
    /usr/local/go/src/net/http/transport.go:1777 +0x25 fp=0xc001505fe0 sp=0xc001505fc8 pc=0x72a405                                                                                                                                             
runtime.goexit()                                                                                                                                                                                                                               
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001505fe8 sp=0xc001505fe0 pc=0x4712e1                                                                                                                                                
created by net/http.(*Transport).dialConn in goroutine 517                                                                                                                                                                                     
    /usr/local/go/src/net/http/transport.go:1777 +0x16f1                                                             

goroutine 486 [IO wait]:                                                                                                                                                                                                                       
runtime.gopark(0xbf97d9ec25bb9557?, 0xb?, 0x0?, 0x0?, 0xd?)                                                                                                                                                                                    
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000ad15c8 sp=0xc000ad15a8 pc=0x43e26e                                                                                                                                                    
runtime.netpollblock(0x4c5158?, 0x407de6?, 0x0?)                                                                                                                                                                                               
    /usr/local/go/src/runtime/netpoll.go:564 +0xf7 fp=0xc000ad1600 sp=0xc000ad15c8 pc=0x436cf7                                                                                                                                                 
internal/poll.runtime_pollWait(0x7fca5da77148, 0x72)                                                                                                                                                                                           
    /usr/local/go/src/runtime/netpoll.go:343 +0x85 fp=0xc000ad1620 sp=0xc000ad1600 pc=0x46b905                                                                                                                                                 
internal/poll.(*pollDesc).wait(0xc0019ac680?, 0xc0009f4000?, 0x0)                                                                                                                                                                              
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000ad1648 sp=0xc000ad1620 pc=0x4e2ec7                                                                                                                                    
internal/poll.(*pollDesc).waitRead(...)                                                                                                                                                                                                        
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89                                                                                                                                                                                      
internal/poll.(*FD).Read(0xc0019ac680, {0xc0009f4000, 0x580, 0x580})                                                                                                                                                                           
    /usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000ad16e0 sp=0xc000ad1648 pc=0x4e41ba                                                                                                                                          
net.(*netFD).Read(0xc0019ac680, {0xc0009f4000?, 0xc0009f4005?, 0x3e6?})                                                                                                                                                                        
    /usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc000ad1728 sp=0xc000ad16e0 pc=0x5ec9a5                                                                                                                                                     
net.(*conn).Read(0xc0007b8128, {0xc0009f4000?, 0xc000295a01?, 0xc002aec538?})                                                                                                                                                                  
    /usr/local/go/src/net/net.go:179 +0x45 fp=0xc000ad1770 sp=0xc000ad1728 pc=0x5fe585                                                                                                                                                         
net.(*TCPConn).Read(0xc000ad1808?, {0xc0009f4000?, 0xc002f140d8?, 0x18?})                                                                                                                                                                      
    <autogenerated>:1 +0x25 fp=0xc000ad17a0 sp=0xc000ad1770 pc=0x60f8c5                                                                                                                                                                        
crypto/tls.(*atLeastReader).Read(0xc002f140d8, {0xc0009f4000?, 0xc002f140d8?, 0x0?})                                                                                                                                                           
    /usr/local/go/src/crypto/tls/conn.go:805 +0x3b fp=0xc000ad17e8 sp=0xc000ad17a0 pc=0x6567fb                                                                                                                                                 
bytes.(*Buffer).ReadFrom(0xc002aec628, {0x3ce03a0, 0xc002f140d8})                                                                                                                                                                              
    /usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc000ad1840 sp=0xc000ad17e8 pc=0x51c9f8                                                                                                                                                    
crypto/tls.(*Conn).readFromUntil(0xc002aec380, {0x3ce1aa0?, 0xc0007b8128}, 0x580?)                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:827 +0xde fp=0xc000ad1880 sp=0xc000ad1840 pc=0x6569de                                                                                                                                                 
crypto/tls.(*Conn).readRecordOrCCS(0xc002aec380, 0x0)                                                                                                                                                                                          
    /usr/local/go/src/crypto/tls/conn.go:625 +0x250 fp=0xc000ad1c20 sp=0xc000ad1880 pc=0x653fb0                                                                                                                                                
crypto/tls.(*Conn).readRecord(...)                                                                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:587                                                                                                                                                                                                   
crypto/tls.(*Conn).Read(0xc002aec380, {0xc00098a000, 0x1000, 0xd?})                                                                                                                                                                            
    /usr/local/go/src/crypto/tls/conn.go:1369 +0x158 fp=0xc000ad1c90 sp=0xc000ad1c20 pc=0x65a278                                                                                                                                               
net/http.(*persistConn).Read(0xc00178c120, {0xc00098a000?, 0xc000868540?, 0xc000ad1d38?})                                                                                                                                                      
    /usr/local/go/src/net/http/transport.go:1954 +0x4a fp=0xc000ad1cf0 sp=0xc000ad1c90 pc=0x72ae4a                                                                                                                                             
bufio.(*Reader).fill(0xc0013c1380)                                                                                                                                                                                                             
    /usr/local/go/src/bufio/bufio.go:113 +0x103 fp=0xc000ad1d28 sp=0xc000ad1cf0 pc=0x696743                                                                                                                                                    
bufio.(*Reader).Peek(0xc0013c1380, 0x1)                                                                                                                                                                                                        
    /usr/local/go/src/bufio/bufio.go:151 +0x53 fp=0xc000ad1d48 sp=0xc000ad1d28 pc=0x696873                                                                                                                                                     
net/http.(*persistConn).readLoop(0xc00178c120)                                                                                                                                                                                                 
    /usr/local/go/src/net/http/transport.go:2118 +0x1b9 fp=0xc000ad1fc8 sp=0xc000ad1d48 pc=0x72bc39                                                                                                                                            
net/http.(*Transport).dialConn.func5()                                                                                                                                                                                                         
    /usr/local/go/src/net/http/transport.go:1776 +0x25 fp=0xc000ad1fe0 sp=0xc000ad1fc8 pc=0x72a465                                                                                                                                             
runtime.goexit()                                                                                                                                                                                                                               
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000ad1fe8 sp=0xc000ad1fe0 pc=0x4712e1                                                                                                                                                
created by net/http.(*Transport).dialConn in goroutine 517                                                                                                                                                                                     
    /usr/local/go/src/net/http/transport.go:1776 +0x169f