attestantio / vouch

Apache License 2.0
112 stars 28 forks source link

[Bug] Segmentation fault during auction / SIGSEGV #196

Open amirderakh opened 7 months ago

amirderakh commented 7 months ago

In vouch 1.8.0 and 1.8.1 I have been observing this error several times in the last few days. It happens during a block auction. Luckily, the relay still publishes the block while vouch is crashing.

This log is from vouch-1.8.1-linux-amd64, connected CLs are Nimbus and Lighthouse. PC reboot had no effect. Update: The error seems to only appear when unblind-from-all-relays is true. The last message is first in this log. After it, vouch restarts:

/home/runner/work/vouch/vouch/services/beaconblockproposer/standard/propose.go:622 +0xad created by github.com/attestantio/vouch/services/beaconblockproposer/standard.(*Service).unblindBlock in goroutine 171275 /home/runner/work/vouch/vouch/services/beaconblockproposer/standard/propose.go:633 +0x356 github.com/attestantio/vouch/services/beaconblockproposer/standard.(*Service).unblindBlock.func1({0x1b28690, 0xc001826540}, {0x7fd0ec203c98, 0xc002618b00}, 0xc02327a0c0) /home/runner/go/pkg/mod/github.com/attestantio/go-builder-client@v0.4.3/http/unblindproposal.go:76 +0x656 github.com/attestantio/go-builder-client/http.(*Service).UnblindProposal(0xc002618b00, {0x1b28690, 0xc001826540}, 0xc02b7e8060) /home/runner/go/pkg/mod/github.com/attestantio/go-builder-client@v0.4.3/http/unblindproposal.go:257 +0x214 github.com/attestantio/go-builder-client/http.(*Service).unblindDenebProposal(0xc002618b00, {0x1b28690, 0xc01d1a9170}, {0x1582b09?, 0xf?, 0x251cde0?}, 0xc003b76000) /opt/hostedtoolcache/go/1.22.1/x64/src/encoding/json/stream.go:63 +0x75 encoding/json.(*Decoder).Decode(0xc00111e280, {0x1319220, 0xc020e1c0b8}) /opt/hostedtoolcache/go/1.22.1/x64/src/encoding/json/stream.go:140 +0x85 encoding/json.(*Decoder).readValue(0xc00111e280) /opt/hostedtoolcache/go/1.22.1/x64/src/encoding/json/stream.go:165 +0x188 encoding/json.(*Decoder).refill(0xc00111e280) /opt/hostedtoolcache/go/1.22.1/x64/src/io/io.go:628 +0x28 io.(*teeReader).Read(0xc0008880a0, {0xc003c54400, 0xc0009df858?, 0x200}) /opt/hostedtoolcache/go/1.22.1/x64/src/runtime/panic.go:770 +0x132 panic({0x13c3560?, 0x247ec30?}) /home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.21.0/trace/span.go:426 +0xa82 go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0xc004077e00, {0x0, 0x0, 0xc031a89340?}) /home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.21.0/trace/span.go:388 +0x25 go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1() goroutine 172547 [running]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4edfc8] panic: runtime error: invalid memory address or nil pointer dereference panic: runtime error: invalid memory address or nil pointer dereference [recovered] {"level":"warn","strategy":"blindedbeaconblockproposal","impl":"first","provider":"xxxxxx","slot":xxxxxx,"error":"failed to request blinded beacon block proposal: failed to call GET endpoint: Get \"http://xxxxx/eth/v1/validator/blinded_blocks/xxxxx\": context canceled","time":"2024-03-28T22:31:49+10:00","message":"Failed to obtain blinded beacon block proposal"} {"level":"info","service":"blockrelay","impl":"standard","slot":xxxxxxx,"provider":"https://relay.ultrasound.money/","value":"39717939211775936","delta":"0","selected":true,"time":"2024-03-28T22:31:48+10:00","message":"Auction participant"}

mcdee commented 7 months ago

Thank you for reporting this. I have run a few different scenarios but cannot find an obvious path as to why this is failing. My best guess is that you have a particular relay in your list that is returning some sort of non-standard response that is causing the failure.

If possible, could you update your Vouch configuration file with the following:

builderclient:
  log-level: 'trace'

and restart Vouch? This will log information sent to the builder, and should provide more details before the failure that should tell us which relay is responding with poor data, and what data it is returning that is causing the problem? If you are uncomfortable with sharing the resultant log publicly then I can provide you with an email address to which you can send it.

amirderakh commented 7 months ago

Hi Jim, thanks for looking into this. However, since "unblind-from-all-relays" fixed the problem for now, I would treat this as a lower priority. I should have enabled trace immediately. I wouldn't risk the crash again now that this is running on a production system.

In all instances, the crash occurred at the same point "addr=0x18 pc=0x4edfc8"/"propose.go:622 +0xad" so you may at least be able to see the problematic instruction. As you say, it is most likely a relay, or timewise the problem also fell together with the update to Lighthouse 5.1.2. If I find anything helpful I will post here. Posting anonymised logs is good enough for me.