TykTechnologies / tyk

Tyk Open Source API Gateway written in Go, supporting REST, GraphQL, TCP and gRPC protocols
Other
9.65k stars 1.08k forks source link

Uptime tests: `failure_trigger_sample_size` not honoured #2698

Closed alephnull closed 4 years ago

alephnull commented 4 years ago

Branch/Environment/Version Gateway :v 2.9.1 Dashboard :v1.9.1

Describe the bug Setting failure_trigger_sample_size as per the documentation, defines the number of failures to wait before triggering a HostDown event.

However, the HostDown event is not triggered even if the number of failures is exceeded.

Reproduction steps

  1. Define an API that will return a failure, or times out
  2. Define uptime checks as follows:
    "uptime_tests": {
    "disable": false,
    "config": {
    "failure_trigger_sample_size": 4,
    "time_wait": 300,
    "checker_pool_size": 50
    }
  3. In the gateway logs that you should see messages of the form: [HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: <url>

Actual behavior The HostDown event is never triggered.

Expected behavior The HostDown event should be triggered when the number of failures set in failure_trigger_sample_size is exceeded.

Logs (debug mode or log file):

time="Nov 26 17:45:14" level=warning msg="[HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: http://localhost:8181/status/400"
time="Nov 26 17:46:20" level=warning msg="[HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: http://localhost:8181/status/400"
time="Nov 26 17:47:22" level=warning msg="[HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: http://localhost:8181/status/400"
time="Nov 26 17:48:36" level=warning msg="[HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: http://localhost:8181/status/400"
time="Nov 26 17:49:49" level=warning msg="[HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: http://localhost:8181/status/400"
time="Nov 26 17:50:50" level=warning msg="[HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: http://localhost:8181/status/400"
time="Nov 26 17:52:02" level=warning msg="[HOST CHECKER] [HOST DOWN BUT NOT REACHED LIMIT]: http://localhost:8181/status/400"

Additional context I suspect that in if count, found := h.sampleCache.Get(failedHost.CheckURL); found { which is defined in HostReporter h.sampleCache is local to the goroutine. Thus, each time it is called you get the failure count from a different goroutine which leads to the behaviour seen.

maciejwojciechowski commented 4 years ago

verified