Shopify / toxiproxy

:alarm_clock: :fire: A TCP proxy to simulate network and system conditions for chaos and resiliency testing
https://github.com/shopify/toxiproxy
MIT License
10.62k stars 445 forks source link

Toxiproxy control port hangs on removing a toxic #558

Open mikael-carlstedt opened 7 months ago

mikael-carlstedt commented 7 months ago

We sometimes observe that Toxiproxy becomes unresponsive when we try to delete a toxic and a server restart is required to recover. This is what it looks like in the server side logs:

Feb 21 09:06:07 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarpvp8i8ksfn6ab1k0","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:06:07Z","message":"Acquire locking..."}
Feb 21 09:06:07 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarpvp8i8ksfn6ab1k0","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:174","time":"2024-02-21T09:06:07Z","message":"Getting toxic by name..."}
Feb 21 09:06:07 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarpvp8i8ksfn6ab1k0","component":"ToxicCollection","method":"chainRemoveToxic","toxic":"latency","direction":"downstream","links":["Link[0xc003b11e50] downstream","Link[0xc0017443c0] downstream","Link[0xc00313b770] downstream","Link[0xc003842730] downstream","Link[0xc003059270] downstream","Link[0xc00313a910] downstream","Link[0xc003bf5e50] downstream","Link[0xc003b119a0] downstream","Link[0xc002c78c30] downstream","Link[0xc003059810] downstream","Link[0xc00313b4f0] downstream","Link[0xc0030b36d0] downstream","Link[0xc002f66370] downstream","Link[0xc0002b54f0] downstream","Link[0xc0000b4910] downstream","Link[0xc003842190] downstream","Link[0xc0030b2a00] downstream","Link[0xc003bf4730] downstream","Link[0xc0002225f0] downstream","Link[0xc0003ee5f0] downstream","Link[0xc003843bd0] downstream","Link[0xc003b119f0] downstream","Link[0xc002f67770] downstream","Link[0xc003843720] downstream","Link[0xc000222640] downstream","Link[0xc0028c0c80] downstream","Link[0xc0000b4870] downstream","Link[0xc0022f5f40] downstream","Link[0xc003bf4e10] downstream","Link[0xc002f66190] downstream","Link[0xc00313b310] downstream","Link[0xc002f66a00] downstream","Link[0xc003b116d0] downstream","Link[0xc003bf4910] downstream","Link[0xc003842410] downstream","Link[0xc0038432c0] downstream","Link[0xc0038428c0] downstream","Link[0xc0030b2640] downstream","Link[0xc0030594f0] downstream","Link[0xc002f67ef0] downstream","Link[0xc0022f4b40] downstream","Link[0xc001745d10] downstream","Link[0xc0003eff40] downstream","Link[0xc002c78a50] downstream","Link[0xc0022f4e60] downstream","Link[0xc0002b5a40] downstream","Link[0xc003bf4410] downstream","Link[0xc0002228c0] downstream","Link[0xc003bf5810] downstream","Link[0xc0003ee230] downstream","Link[0xc002f665a0] downstream","Link[0xc003bf4b90] downstream","Link[0xc000223040] downstream","Link[0xc003b11310] downstream","Link[0xc00313bc20] downstream","Link[0xc002f674f0] downstream","Link[0xc002f67d60] downstream","Link[0xc0015fc190] downstream","Link[0xc003059c70] downstream","Link[0xc002740730] downstream","Link[0xc002c78870] downstream","Link[0xc002c785a0] downstream","Link[0xc0000b4410] downstream","Link[0xc0022f4b90] downstream","Link[0xc0028c1b80] downstream","Link[0xc002ae8460] downstream","Link[0xc003bf5ae0] downstream","Link[0xc001745310] downstream","Link[0xc0002b4
dc0] downstream","Link[0xc00313a140] downstream","Link[0xc00313a3c0] downstream","Link[0xc0022f4690] downstream","Link[0xc00313a730] downstream","Link[0xc002f66690] downstream","Link[0xc002c78ff0] downstream","Link[0xc002ae9630] downstream","Link[0xc0000b4ff0] downstream","Link[0xc0015fda40] downstream","Link[0xc0030b3d10] downstream","Link[0xc000222c80] downstream"],"caller":"toxic_collection.go:295","time":"2024-02-21T09:06:07Z","message":"Waiting to update links"}
Feb 21 09:06:07 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarpvp8i8ksfn6ab1k0","component":"ToxicLink","method":"RemoveToxic","toxic":"latency","toxic_type":"latency","toxic_index":1,"link_addr":"0xc000222c80","toxic_stub_addr":"0xc001b0db00","prev_toxic_stub_addr":"0xc003904b80","caller":"link.go:228","time":"2024-02-21T09:06:07Z","message":"Interrupting the previous toxic to update its output"}
...
Feb 21 09:06:07 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarpvp8i8ksfn6ab1k0","component":"ToxicLink","method":"RemoveToxic","toxic":"latency","toxic_type":"latency","toxic_index":1,"link_addr":"0xc003059810","toxic_stub_addr":"0xc000121440","prev_toxic_stub_addr":"0xc000121380","caller":"link.go:228","time":"2024-02-21T09:06:07Z","message":"Interrupting the previous toxic to update its output"}
Feb 21 09:06:32 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarpvp8i8ksfn6ab1k0","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.402663,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:06:32Z"}
Feb 21 09:06:33 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarq698i8ksfn6ab1kg","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:06:33Z","message":"Acquire locking..."}
Feb 21 09:06:58 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarq698i8ksfn6ab1kg","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.548837,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:06:58Z"}
Feb 21 09:06:59 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarqcp8i8ksfn6ab1l0","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:06:59Z","message":"Acquire locking..."}
Feb 21 09:07:24 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarqcp8i8ksfn6ab1l0","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.688535,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:07:24Z"}
Feb 21 09:07:25 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarqj98i8ksfn6ab1lg","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:07:25Z","message":"Acquire locking..."}
Feb 21 09:07:50 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarqj98i8ksfn6ab1lg","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.826157,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:07:50Z"}
Feb 21 09:07:51 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarqpp8i8ksfn6ab1m0","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:07:51Z","message":"Acquire locking..."}
Feb 21 09:08:16 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarqpp8i8ksfn6ab1m0","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.353073,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:08:16Z"}
Feb 21 09:08:17 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarr098i8ksfn6ab1mg","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:08:17Z","message":"Acquire locking..."}
Feb 21 09:08:42 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarr098i8ksfn6ab1mg","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.462223,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:08:42Z"}
Feb 21 09:08:43 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarr6p8i8ksfn6ab1n0","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:08:43Z","message":"Acquire locking..."}
Feb 21 09:09:08 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarr6p8i8ksfn6ab1n0","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.868655,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:09:08Z"}
Feb 21 09:09:09 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarrd98i8ksfn6ab1ng","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:09:09Z","message":"Acquire locking..."}
Feb 21 09:09:34 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarrd98i8ksfn6ab1ng","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.544839,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:09:34Z"}
Feb 21 09:09:35 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarrjp8i8ksfn6ab1o0","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:09:35Z","message":"Acquire locking..."}
Feb 21 09:10:00 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarrjp8i8ksfn6ab1o0","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.123947,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:10:00Z"}
Feb 21 09:10:01 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarrq98i8ksfn6ab1og","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:170","time":"2024-02-21T09:10:01Z","message":"Acquire locking..."}
Feb 21 09:10:26 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"debug","request_id":"cnarrq98i8ksfn6ab1og","client":"10.96.162.30:38426","method":"DELETE","url":"/proxies/10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net/toxics/latency","status":503,"size":77,"duration":25000.077873,"handler":"ToxicDelete","caller":"api.go:78","time":"2024-02-21T09:10:26Z"}

Compared with a previous successful delete toxic operation, this one is missing the terminating log line:

Feb 21 08:48:08 us1tst-cnxms004.int.clxnetworks.net toxiproxy-server[1104]: {"level":"trace","request_id":"cnarhi18i8ksfn6ab0t0","component":"ToxicCollection","method":"RemoveToxic","toxic":"latency","proxy":"10.96.147.189:9094@us1tst-cnxms004.int.clxnetworks.net","caller":"toxic_collection.go:182","time":"2024-02-21T08:48:08Z","message":"Finished"}

Assuming that is when it releases the mentioned lock, and the reason why the subsequent retries from the client are stuck on acquiring the lock.

The log excerpt has been cleaned from proxy traffic logging.

mikael-carlstedt commented 7 months ago

I can add that the toxicity that we are trying to remove is a 6 s downstream delay on all requests, if that could be a clue to the root cause.

An observation of interest is that the message "Interrupting the previous toxic to update its output" is missing for the nine first links listed with the message "Waiting to update links", which would seem to point at the tenth link never gets completed (it appears to start with the last link and iterate backwards in the list).

mikael-carlstedt commented 7 months ago

Could this be the same issue as described in #427 ?