envoyproxy / xds-relay

Caching, aggregation, and relaying for xDS compliant clients and origin servers
Apache License 2.0
132 stars 29 forks source link

Handle panic #180

Closed jyotimahapatra closed 4 years ago

jyotimahapatra commented 4 years ago

While rolling xdsrelay for. t1 services, we discovered that there are some edge cases during which send happens on closed channel.

  | Nov 5, 2020 @ 11:16:55.595 | child exited

  | Nov 5, 2020 @ 11:16:55.595 | panic stacktrace

  | Nov 5, 2020 @ 11:16:55.595 | child process exited with error

  | Nov 5, 2020 @ 11:16:55.595 | error waiting for child process to exit

  | Nov 5, 2020 @ 11:16:55.595 | wrap exited

  | Nov 5, 2020 @ 11:16:55.592 | ignoring signal: SIGCHLD

  | Nov 5, 2020 @ 11:16:55.576 | -

  | Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/transport/watchv3.go:34 +0xd3

  | Nov 5, 2020 @ 11:16:55.576 | github.com/envoyproxy/xds-relay/internal/app/transport.(*watchV3).Send(0xc003e9c7a8, 0x1034080, 0xc0023f5e80, 0x1034080)

  | Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/orchestrator/downstream.go:113 +0x90

  | Nov 5, 2020 @ 11:16:55.576 | panic: send on closed channel

  | Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/orchestrator/orchestrator.go:199 +0x11a6

  | Nov 5, 2020 @ 11:16:55.576 | goroutine 11600 [running]:

  | Nov 5, 2020 @ 11:16:55.576 | github.com/envoyproxy/xds-relay/internal/app/orchestrator.(*responseChannel).addResponse(0xc003129000, 0x1034080, 0xc0023f5e80, 0x0, 0x0)

  | Nov 5, 2020 @ 11:16:55.576 | github.com/envoyproxy/xds-relay/internal/app/orchestrator.(*orchestrator).CreateWatch.func1(0xc003129000, 0xc002e7ff80, 0xc000481200, 0xc003b66a00, 0x17)

  | Nov 5, 2020 @ 11:16:55.576 | detected panic prefix

  | Nov 5, 2020 @ 11:16:55.576 | created by github.com/envoyproxy/xds-relay/internal/app/orchestrator.(*orchestrator).CreateWatch

  | Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/orchestrator/orchestrator.go:200 +0x45

Upon investigation, we found that due to the highly ephemeral and concurrent nature of sidecar watches, its possible that cancelWatch is called while send is happening. We already have a waitGroup based locking, but its not providing protection. The lock should be as close as possible to the critical section.

I'm proposing that we depend on a mutex to coordinate between channel close and send. Go does not provide any guarantees around it[ref1, ref2].

Signed-off-by: Jyoti Mahapatra jmahapatra@lyft.com