Hi, it's me again trying to improve route update performance for our absurd number of routes. 😁
There's two major changes here:
The first is adding "filtering" so that only Consul Health Checks with the applicable Service Tag Prefix (and serf/maintenance checks) are considered during the rest of the process. In our Production environments this brings down the number of checks that Fabio has to look at from ~40k down to ~12k.
The second change is simplifying the passingServices function from an $O(n^5)$ process to $O(n^2)$. Each of the "helper" functions (countChecksisAgentCriticalisNodeInMaintenanceisServiceInMaintenance) did their own loop over each of the Consul Health Checks which added up in time very quickly.
Both of these changes drastically reduced the amount of time spent processing responses from Consul. In a test using real Consul data but not serving traffic we saw a reduction from around 90 seconds to less than 10 seconds. We've been testing Fabio with these changes in a non-production environment for about a week now, and have not detected any issues.
(The increase in makeConfig shown on the graph is due to how I measured it. It includes time waiting to write on the channel back to the main Fabio goroutine. This wasn't an issue before because processing the Consul data took longer than the time needed to build the actual route table (~10s) but the scales have now flipped. Building the Route Table takes longer than processing the data from Consul.)
I also had profiling running during my tests. Here's stock Fabio:
After adding the Check "filtering":
After simplifying passingServices:
I'm very interested in any feedback about these changes and will do what I can to help satisfy any concerns. Thanks!
Hi, it's me again trying to improve route update performance for our absurd number of routes. 😁
There's two major changes here:
The first is adding "filtering" so that only Consul Health Checks with the applicable Service Tag Prefix (and serf/maintenance checks) are considered during the rest of the process. In our Production environments this brings down the number of checks that Fabio has to look at from ~40k down to ~12k.
The second change is simplifying the
passingServices
function from an $O(n^5)$ process to $O(n^2)$. Each of the "helper" functions (countChecks
isAgentCritical
isNodeInMaintenance
isServiceInMaintenance
) did their own loop over each of the Consul Health Checks which added up in time very quickly.Both of these changes drastically reduced the amount of time spent processing responses from Consul. In a test using real Consul data but not serving traffic we saw a reduction from around 90 seconds to less than 10 seconds. We've been testing Fabio with these changes in a non-production environment for about a week now, and have not detected any issues.
(The increase in
makeConfig
shown on the graph is due to how I measured it. It includes time waiting to write on the channel back to the main Fabio goroutine. This wasn't an issue before because processing the Consul data took longer than the time needed to build the actual route table (~10s) but the scales have now flipped. Building the Route Table takes longer than processing the data from Consul.)I also had profiling running during my tests. Here's stock Fabio:![Stock Fabio CPU profile](https://user-images.githubusercontent.com/82290/227013546-261791b4-1ea6-44df-882b-59bb3f193a47.png)
After adding the Check "filtering":![Consul Health Check filtering CPU profile](https://user-images.githubusercontent.com/82290/227013754-c8182057-cc23-497c-8bcd-ff72b4308672.png)
After simplifying![Filtering plus simplified passingServices](https://user-images.githubusercontent.com/82290/227013671-3868501a-2105-42c1-8334-a3ccd1c71378.png)
passingServices
:I'm very interested in any feedback about these changes and will do what I can to help satisfy any concerns. Thanks!