Netflix / dynomite

A generic dynamo implementation for different k-v storage engines
Apache License 2.0
4.2k stars 533 forks source link

Fix crash that occurs when same-DC peer goes down when query in flight #683

Closed smukil closed 5 years ago

smukil commented 5 years ago

We incorrectly were forwarding errors when unable to connect to cross-rack peers under DC_ONE. We should just log and ignore such errors under DC_ONE. How this ultimately caused an issue was that the forwarded error is serviced before any other replica can respond, which drops the request from the client's out-q and frees it. Then, once the replicas start responding, their server in-q's still have a reference to the original request that got freed. Since the client has already freed it, the other responses access garbage/repurposed memory, hence causing a crash in a couple of places.

This patch just ignores such errors and does not forward them.

smukil commented 5 years ago

Thanks to @shailesh33 for help with debugging.