brigadecore / brigade

Event-driven scripting for Kubernetes
https://brigade.sh/
Apache License 2.0
2.4k stars 247 forks source link

bug? seemingly random job abortion #1899

Open krancour opened 2 years ago

krancour commented 2 years ago

Twice today, I saw a job aborted for no readily apparent reason, and the associated worker failed.

The first occurrence was with v2.3.1 and the second was after upgrading the dogfood cluster to v2.4.0-rc.2.

Here's the data I gathered from the second occurrence:

$ brig event get -i 61a4332c-c26e-4fa8-a3a0-4c5bd16cc107

ID                                      PROJECT SOURCE              TYPE                    AGE WORKER PHASE
61a4332c-c26e-4fa8-a3a0-4c5bd16cc107    brigade brigade.sh/github   ci:pipeline_requested   1h  FAILED      

Event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" jobs:

NAME                STARTED ENDED   PHASE    
test-unit-js        1h      1h      SUCCEEDED
yarn-audit          1h      1h      SUCCEEDED
lint-chart          1h      1h      SUCCEEDED
validate-examples   1h      1h      SUCCEEDED
lint-js                             PENDING  
validate-schemas    1h      1h      SUCCEEDED
style-check-js      1h      1h      SUCCEEDED
test-unit-go        1h              ABORTED  
lint-go                             PENDING

Worker logs reveal this:

...

2022-03-30T23:54:23.354Z ERROR: Error creating job "test-unit-js": An internal server error occurred.
/var/brigade-worker/worker/dist/prestart.js:204
        throw new Error(`error executing node brigade.js:\n\n${e.output}`);
        ^

...

An internal server error by the API server should have been logged, so I looked there next:

2022/03/30 23:54:23 error scheduling event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-js" on the substrate: error submitting execution task for event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-js": error sending amqp message for queue "jobs.brigade": unexpected frame: &amqp.performAttach{Name:"mwYiuAiQm58ubLB3bQ77krRjFxXwde2eDQTkYl1RObbAOV6rvi73nw", Handle:0x0, Role:true, SenderSettleMode:(*amqp.SenderSettleMode)(0xc0009a4610), ReceiverSettleMode:(*amqp.ReceiverSettleMode)(0xc0009a4611), Source:(*amqp.source)(nil), Target:(*amqp.target)(0xc00008dd10), Unsettled:amqp.unsettled(nil), IncompleteUnsettled:false, InitialDeliveryCount:0x0, MaxMessageSize:0x0, OfferedCapabilities:amqp.multiSymbol(nil), DesiredCapabilities:amqp.multiSymbol(nil), Properties:map[amqp.symbol]interface {}(nil)}
2022/03/30 23:54:23 error scheduling event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "validate-schemas" on the substrate: error submitting execution task for event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "validate-schemas": error sending amqp message for queue "jobs.brigade": unexpected frame: &amqp.performAttach{Name:"mwYiuAiQm58ubLB3bQ77krRjFxXwde2eDQTkYl1RObbAOV6rvi73nw", Handle:0x0, Role:true, SenderSettleMode:(*amqp.SenderSettleMode)(0xc0009a4610), ReceiverSettleMode:(*amqp.ReceiverSettleMode)(0xc0009a4611), Source:(*amqp.source)(nil), Target:(*amqp.target)(0xc00008dd10), Unsettled:amqp.unsettled(nil), IncompleteUnsettled:false, InitialDeliveryCount:0x0, MaxMessageSize:0x0, OfferedCapabilities:amqp.multiSymbol(nil), DesiredCapabilities:amqp.multiSymbol(nil), Properties:map[amqp.symbol]interface {}(nil)}
2022/03/30 23:54:23 error scheduling event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-go" on the substrate: error submitting execution task for event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-go": error sending amqp message for queue "jobs.brigade": unexpected frame: &amqp.performAttach{Name:"mwYiuAiQm58ubLB3bQ77krRjFxXwde2eDQTkYl1RObbAOV6rvi73nw", Handle:0x0, Role:true, SenderSettleMode:(*amqp.SenderSettleMode)(0xc0009a4610), ReceiverSettleMode:(*amqp.ReceiverSettleMode)(0xc0009a4611), Source:(*amqp.source)(nil), Target:(*amqp.target)(0xc00008dd10), Unsettled:amqp.unsettled(nil), IncompleteUnsettled:false, InitialDeliveryCount:0x0, MaxMessageSize:0x0, OfferedCapabilities:amqp.multiSymbol(nil), DesiredCapabilities:amqp.multiSymbol(nil), Properties:map[amqp.symbol]interface {}(nil)}
2022/03/30 23:54:24 http: panic serving 10.240.0.104:39214: runtime error: invalid memory address or nil pointer dereference
goroutine 128736 [running]:
net/http.(*conn).serve.func1()
    /usr/local/go/src/net/http/server.go:1825 +0xbf
panic({0x1704b20, 0x2788530})
    /usr/local/go/src/runtime/panic.go:844 +0x258
github.com/Azure/go-amqp.(*Sender).Close(0xc000952660?, {0x1c0ea80?, 0xc000114008?})
    /go/pkg/mod/github.com/!azure/go-amqp@v0.13.1/client.go:458 +0x1e
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter.func1()
    /src/v2/apiserver/internal/lib/queue/amqp/writer.go:109 +0x79
github.com/brigadecore/brigade-foundations/retries.ManageRetries({0x1c0ea80, 0xc000114000}, {0x191f5b3, 0xd}, 0xa, 0x0?, 0xc0007e00c0)
    /go/pkg/mod/github.com/brigadecore/brigade-foundations@v0.3.0/retries/retries.go:50 +0x79
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter(0xc00008d950, {0xc00059c6d0, 0xc})
    /src/v2/apiserver/internal/lib/queue/amqp/writer.go:100 +0x24a
github.com/brigadecore/brigade/v2/apiserver/internal/api/kubernetes.(*substrate).ScheduleJob(_, {_, _}, {{{0xc00059c5f6, 0x7}, 0xc0009a8f60}, {0xc000a43fa0, 0x1f}, {{0xc0002e8bd0, 0x2, ...}, ...}, ...}, ...)
    /src/v2/apiserver/internal/api/kubernetes/substrate.go:569 +0xec
github.com/brigadecore/brigade/v2/apiserver/internal/api.(*jobsService).Create(0xc00008dea0, {0x1c0eaf0, 0xc0003e9a70}, {0xc00047c7e0, _}, {{0xc000873828, 0x7}, 0xc0009a88d0, {{{{...}, {...}, ...}, ...}, ...}, ...})
    /src/v2/apiserver/internal/api/jobs.go:504 +0xeaf
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create.func1()
    /src/v2/apiserver/internal/api/rest/jobs_endpoints.go:76 +0xef
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.ServeRequest({{0x1c0e0e8, 0xc0006f2ee0}, 0xc0009e4000, {0x1c0ecb0, 0xc00003a7a0}, {0x16ce580, 0xc00051d040}, 0xc00003ba00, 0xc9})
    /src/v2/apiserver/internal/lib/restmachinery/endpoints.go:123 +0xb6
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create(0xc0000aba80, {0x1c0e0e8, 0xc0006f2ee0}, 0xc0009e4000)
    /src/v2/apiserver/internal/api/rest/jobs_endpoints.go:69 +0x194
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*tokenAuthFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2ee0}, 0xc00096ef00)
    /src/v2/apiserver/internal/api/rest/token_auth_filter.go:154 +0xa6b
net/http.HandlerFunc.ServeHTTP(0xc00096ed00?, {0x1c0e0e8?, 0xc0006f2ee0?}, 0xb50000c0007cd740?)
    /usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0000c5200, {0x1c0e0e8, 0xc0006f2ee0}, 0xc00096ec00)
    /go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 +0x1cf
github.com/rs/cors.(*Cors).Handler.func1({0x1c0e0e8, 0xc0006f2ee0}, 0xc00096ec00)
    /go/pkg/mod/github.com/rs/cors@v1.7.0/cors.go:219 +0x1bd
net/http.HandlerFunc.ServeHTTP(0x40d687?, {0x1c0e0e8?, 0xc0006f2ee0?}, 0x4f1001?)
    /usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.(*requestContextFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2ee0}, 0xc00096eb00)
    /src/v2/apiserver/internal/lib/restmachinery/request_context_filter.go:38 +0x288
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c0e0e8?, 0xc0006f2ee0?}, 0x40d325?)
    /usr/local/go/src/net/http/server.go:2084 +0x2f
net/http.serverHandler.ServeHTTP({0x1c024f8?}, {0x1c0e0e8, 0xc0006f2ee0}, 0xc00096eb00)
    /usr/local/go/src/net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc00040c1e0, {0x1c0eaf0, 0xc000263e30})
    /usr/local/go/src/net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:3071 +0x4db
2022/03/30 23:54:27 http: panic serving 10.240.0.104:39216: runtime error: invalid memory address or nil pointer dereference
goroutine 128753 [running]:
net/http.(*conn).serve.func1()
    /usr/local/go/src/net/http/server.go:1825 +0xbf
panic({0x1704b20, 0x2788530})
    /usr/local/go/src/runtime/panic.go:844 +0x258
github.com/Azure/go-amqp.(*Session).Close(0x0, {0x1c0ea80, 0xc000114008})
    /go/pkg/mod/github.com/!azure/go-amqp@v0.13.1/client.go:256 +0x30
github.com/brigadecore/brigade/v2/internal/amqp.(*session).Close(0xc0006b4cc0?, {0x1c0ea80?, 0xc000114008?})
    /src/v2/internal/amqp/session.go:38 +0x26
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter.func1()
    /src/v2/apiserver/internal/lib/queue/amqp/writer.go:112 +0xbc
github.com/brigadecore/brigade-foundations/retries.ManageRetries({0x1c0ea80, 0xc000114000}, {0x191f5b3, 0xd}, 0xa, 0x0?, 0xc0007440c0)
    /go/pkg/mod/github.com/brigadecore/brigade-foundations@v0.3.0/retries/retries.go:50 +0x79
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter(0xc00008d950, {0xc0009a43c0, 0xc})
    /src/v2/apiserver/internal/lib/queue/amqp/writer.go:100 +0x24a
github.com/brigadecore/brigade/v2/apiserver/internal/api/kubernetes.(*substrate).ScheduleJob(_, {_, _}, {{{0xc000109502, 0x7}, 0xc0005a6210}, {0xc000a9c920, 0x1f}, {{0xc0001e7500, 0x2, ...}, ...}, ...}, ...)
    /src/v2/apiserver/internal/api/kubernetes/substrate.go:569 +0xec
github.com/brigadecore/brigade/v2/apiserver/internal/api.(*jobsService).Create(0xc00008dea0, {0x1c0eaf0, 0xc00061c210}, {0xc00047c600, _}, {{0xc00052d598, 0x7}, 0xc000a73668, {{{{...}, {...}, ...}, ...}, ...}, ...})
    /src/v2/apiserver/internal/api/jobs.go:504 +0xeaf
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create.func1()
    /src/v2/apiserver/internal/api/rest/jobs_endpoints.go:76 +0xef
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.ServeRequest({{0x1c0e0e8, 0xc0006f2d20}, 0xc000b17c00, {0x1c0ecb0, 0xc00003a7a0}, {0x16ce580, 0xc000110a90}, 0xc0008edf40, 0xc9})
    /src/v2/apiserver/internal/lib/restmachinery/endpoints.go:123 +0xb6
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create(0xc0000aba80, {0x1c0e0e8, 0xc0006f2d20}, 0xc000b17c00)
    /src/v2/apiserver/internal/api/rest/jobs_endpoints.go:69 +0x194
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*tokenAuthFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2d20}, 0xc00096e600)
    /src/v2/apiserver/internal/api/rest/token_auth_filter.go:154 +0xa6b
net/http.HandlerFunc.ServeHTTP(0xc00096e500?, {0x1c0e0e8?, 0xc0006f2d20?}, 0x190000c0007cd3c0?)
    /usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0000c5200, {0x1c0e0e8, 0xc0006f2d20}, 0xc00096e400)
    /go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 +0x1cf
github.com/rs/cors.(*Cors).Handler.func1({0x1c0e0e8, 0xc0006f2d20}, 0xc00096e400)
    /go/pkg/mod/github.com/rs/cors@v1.7.0/cors.go:219 +0x1bd
net/http.HandlerFunc.ServeHTTP(0x40d687?, {0x1c0e0e8?, 0xc0006f2d20?}, 0x4f1001?)
    /usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.(*requestContextFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2d20}, 0xc00096e300)
    /src/v2/apiserver/internal/lib/restmachinery/request_context_filter.go:38 +0x288
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c0e0e8?, 0xc0006f2d20?}, 0x40d325?)
    /usr/local/go/src/net/http/server.go:2084 +0x2f
net/http.serverHandler.ServeHTTP({0x1c024f8?}, {0x1c0e0e8, 0xc0006f2d20}, 0xc00096e300)
    /usr/local/go/src/net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc00040c320, {0x1c0eaf0, 0xc000263e30})
    /usr/local/go/src/net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:3071 +0x4db

This has the appearance of the job failing to be placed on the queue, for reasons I cannot yet diagnose.

However, if I look at the logs for that job, they exist. The job ran. In fact, it executed hundreds of unit tests before it was forcefully terminated. This means the job did make it onto the queue after all... and this explains a lot...

After the worker failed (or so it thought) to create a new job, the worker exited non-zero. The observer recorded this correctly as a worker failure. Eventual worker cleanup forcefully killed the running job, which actually was created.

There's more to investigate here, but the fundamental question is probably one of how we ended up with an error that makes it appear as though the job never made it onto the queue event though it very clearly did.

krancour commented 2 years ago

A closer look at the logs reveals that three jobs appear not to have made it onto the queue, although they did. Two of them happened to succeed before the observed cleaned up the dead worker. The one of the three still running was aborted at the time of cleanup.