Closed dcarley closed 1 year ago
@ADD-SP what do ya think?
@flrgh We've run into this type of issue, which is essentially timer abuse, and I think we should fix this issue like https://github.com/Kong/kong/pull/9521 did.
While we will investigate this issue and put some limits on the amount of memory and timers that zipkin consumes, I must point out that this setting is not realistic for production environments:
sample_ratio: 1
The intended usage for the Zipkin plugin is that tracing is done on a sample of all the requests - that is why the default value for sample_ratio
is 0.001
. It means that 1 out of 1000 requests will produce traces. Setting it to 1 is akin to setting the log level to debug
. Something that would be done on a production system on an emergency setting temporarily in order to debug a critical failure.
I will make sure that we point this out in our docs.
We use a sampling rate of 100% because we do dynamic sampling in Refinery. We've been using this setup with 2.x in production mostly without problems since we added timeouts in #8735. We are planning to migrate to the OpenTelemetry plugin, with the same sample rate, once we've validated that the upgrade path to 3.x is safe. Why is why I'm most interested in:
However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.
I'm not able to view the issues FTI-4367 and FTI-4269 that were referenced in #9521. Are you able to provide some more information about what originally prompted that change?
Is there anything else that I can do to help?
@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.
We've observed that 3.x performance is worse under load when a plugin makes calls to an external network dependency and those calls take longer than normal.
Are you able to provide any information from FTI-4367, FTI-4269, and FT-3464? Were those issues prompted by timer performance?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Another issue related to time abuse https://github.com/Kong/kong/issues/9959
good thankyou
Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?
Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?
3.2.2 appears to be worse than 3.0.1 for the reproduction case where rate-limiting
"fails open" when the zipkin
endpoint is timing out. I've updated the results in the gist here: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c#gojira-322
Dear contributor, We're closing this issue as there hasn't been any update to it for a long time. If the issue is still relevant in the latest version, please feel free to reopen it. We're more than happy to revisit it again. Your contribution is greatly appreciated! Please have a look at our pledge to the community for more information. Sincerely, Kong Gateway Team
Is there an existing issue for this?
Kong version (
$ kong version
)3.0.1 (also tested against 3.1.0)
Current Behavior
(sorry, this is long)
We have a set of integration tests that run against our installation of Kong. Our rate limiting tests have been failing intermittently, with more requests allowed than there should be, when testing an upgrade from Kong 2.8.3 to Kong 3.x (initially 3.0.0 but we've more recently tried 3.0.1 and 3.1.0). When this happens it appears to take a while to recover and sometimes doesn't recover at all. That's not a terribly useful report without a complete reproduction of how we deploy and test Kong though so I've attempted to narrow it down.
One way that I've been able to reproduce the same symptoms is by configuring the Zipkin plugin (which we normally use) with an endpoint that times out (e.g. an unreachable IP address). When generating requests against 3.x this immediately causes more requests to be allowed than there should be. When increasing the request rate it eventually causes other plugins and operations that depend on timers to also fail and not recover:
The significant difference between 2.x and 3.x is that it appears to fail with lower requests rates, earlier on, and doesn't recover. I think that rate limiting is acting as an "early warning" of this because delays in running timers means that usage counters aren't being incremented quick enough within each "second" bucket.
Expected Behavior
The performance of one plugin and its network dependencies shouldn't adversely affect the performance of other plugins and operations.
I expect that changes like https://github.com/Kong/kong/pull/9538 (rate-limiting) and https://github.com/Kong/kong/pull/9521 (Datadog) will alleviate some of the symptoms of this. I wasn't able to reproduce the same problem with the OpenTelemetry plugin, which already batches+intervals submissions. I suspect that applying the same changes to the Zipkin plugin would help.
However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.
Steps To Reproduce
This is the simplest way that I could reproduce the symptom. I wasn't able to push it hard enough to cause timer errors in Gojira, like we can in Minikube, because Docker appeared to fall over first.
Install:
Create
egg.yaml
:Create
kong.yaml
Start the container:
Run a load test:
Check the results, ideally:
jaggr.out
should show roughly 50 responses with 200 statuses each secondvegeta report vegeta.out
should show a success rate of 25% (50/200)Anything else?
I've put the results of my tests in a Gist so as not to clutter the issue: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c