Kong / lua-resty-dns-client

Lua DNS client, load balancer, and utility library
http://kong.github.io/lua-resty-dns-client/topics/README.md.html
Apache License 2.0
152 stars 52 forks source link

fix(base) reduce the amount of timers used #102

Closed kikito closed 4 years ago

kikito commented 4 years ago

The current implementation creates one recurring timer per balancer.

This is ok in many circumstances, but can exhaust nginx's allocated resources.

This PR changes that by a single timer which updates all balancers sequentially. When there's a need for updating a host, that part gets delegated to an ephemeral (non-recurring) timer.

This way only 1 recurring timer gets created, and ephemeral timers are created only when hosts need to be updated, and then released.

Tieske commented 4 years ago

why create the ephermeral timer? why not execute in-line?

More structural comment; we should not be registering balancers, but hosts. The hosts should be put in a binary tree, by ttl, and once per second we should check the tree. With large sets of balancers that will be far more efficient, since we only need to check a few hosts that expired, instead of looping over all of them (assume 1600 balancers, each with 5 hosts, that is 8000 entries we would iterate over every second)

kikito commented 4 years ago

why create the ephermeral timer? why not execute in-line?

My assumption is that host:queryDns is not immediate - that it takes some time. If it takes 1 cent of a second to execute, then that means that if more than 100 hosts need to be updated, then doing them "in-line" would mean that one execution of the main timer would not have finished when the next one starts.

By delegating each call to host:queryDns to an ephemeral timer, the main timer is pure non-blocking Lua. It can go over I don't know how many hosts per second. Hundreds of thousands?. So this should avoid the "step over itself" problem.

There can still be some "wasteful timer" if host:queryDns takes longer than 1 second for a given host. Then the first main timer will create 1 ephemeral timer to queryDns, the second one will create another, etc, until the first one finishes. But they will eventually clean up.

More structural comment; we should not be registering balancers, but hosts. The hosts should be put in a binary tree, by ttl, and once per second we should check the tree. With large sets of balancers that will be far more efficient, since we only need to check a few hosts that expired, instead of looping over all of them (assume 1600 balancers, each with 5 hosts, that is 8000 entries we would iterate over every second)

My intent for this PR was fixing the issue with the client that we were currently having, not finding a general solution.

Tieske commented 4 years ago

My assumption is that host:queryDns is not immediate - that it takes some time. If it takes 1 cent of a second to execute, then that means that if more than 100 hosts need to be updated, then doing them "in-line" would mean that one execution of the main timer would not have finished when the next one starts.

lua-resty-timer reschedules the next invocation after the work is done, so this would not happen. AT least that was my thinking, but that is only the case with the most recent version, the previous version would actually reschedule before starting the work.

Anyway, I think this should be closed in favour of a better solution

kikito commented 4 years ago

Ok, closing for now.

kikito commented 4 years ago

I still think that the "a single recurring thread which checks everything and delegates jobs to ephemeral threads" is the way to go here. Optimizing the checks with a tree data structure would certainly help, but if we make it fast enough, it might not be needed. "Marking" the hosts which are already being processed so no wasteful ephemeral threads are needed would also be important in order to make this work for everyone.

I also forgot to mention one main advantage of this approach vs a more "sequential" one: doing things on ephemeral timers allows for a parallel processing of the "queue", provided that there are workers available.

I would advice to try to not rely on a "glocal" var (_entities, on my example). I would very much rather have that "list of entities" living somewhere on the user's space. But this would need an extra entity, like a "connection". I don't know if such thing exists on this library. If not, that would be a breaking change.