While rolling xdsrelay for. t1 services, we discovered that there are some edge cases during which send happens on closed channel.
| Nov 5, 2020 @ 11:16:55.595 | child exited
| Nov 5, 2020 @ 11:16:55.595 | panic stacktrace
| Nov 5, 2020 @ 11:16:55.595 | child process exited with error
| Nov 5, 2020 @ 11:16:55.595 | error waiting for child process to exit
| Nov 5, 2020 @ 11:16:55.595 | wrap exited
| Nov 5, 2020 @ 11:16:55.592 | ignoring signal: SIGCHLD
| Nov 5, 2020 @ 11:16:55.576 | -
| Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/transport/watchv3.go:34 +0xd3
| Nov 5, 2020 @ 11:16:55.576 | github.com/envoyproxy/xds-relay/internal/app/transport.(*watchV3).Send(0xc003e9c7a8, 0x1034080, 0xc0023f5e80, 0x1034080)
| Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/orchestrator/downstream.go:113 +0x90
| Nov 5, 2020 @ 11:16:55.576 | panic: send on closed channel
| Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/orchestrator/orchestrator.go:199 +0x11a6
| Nov 5, 2020 @ 11:16:55.576 | goroutine 11600 [running]:
| Nov 5, 2020 @ 11:16:55.576 | github.com/envoyproxy/xds-relay/internal/app/orchestrator.(*responseChannel).addResponse(0xc003129000, 0x1034080, 0xc0023f5e80, 0x0, 0x0)
| Nov 5, 2020 @ 11:16:55.576 | github.com/envoyproxy/xds-relay/internal/app/orchestrator.(*orchestrator).CreateWatch.func1(0xc003129000, 0xc002e7ff80, 0xc000481200, 0xc003b66a00, 0x17)
| Nov 5, 2020 @ 11:16:55.576 | detected panic prefix
| Nov 5, 2020 @ 11:16:55.576 | created by github.com/envoyproxy/xds-relay/internal/app/orchestrator.(*orchestrator).CreateWatch
| Nov 5, 2020 @ 11:16:55.576 | /code/xds-relay-private/xds-relay/internal/app/orchestrator/orchestrator.go:200 +0x45
Upon investigation, we found that due to the highly ephemeral and concurrent nature of sidecar watches, its possible that cancelWatch is called while send is happening. We already have a waitGroup based locking, but its not providing protection. The lock should be as close as possible to the critical section.
I'm proposing that we depend on a mutex to coordinate between channel close and send. Go does not provide any guarantees around it[ref1, ref2].
While rolling xdsrelay for. t1 services, we discovered that there are some edge cases during which send happens on closed channel.
Upon investigation, we found that due to the highly ephemeral and concurrent nature of sidecar watches, its possible that
cancelWatch
is called while send is happening. We already have awaitGroup
based locking, but its not providing protection. The lock should be as close as possible to the critical section.I'm proposing that we depend on a mutex to coordinate between channel close and send. Go does not provide any guarantees around it[ref1, ref2].
Signed-off-by: Jyoti Mahapatra jmahapatra@lyft.com