cannatag / ldap3

a strictly RFC 4510 conforming LDAP V3 pure Python client. The same codebase works with Python 2. Python 3, PyPy and PyPy3
Other
879 stars 272 forks source link

timed pooling strategy for geo-redundant setups #965

Open cornelinux opened 3 years ago

cornelinux commented 3 years ago

Imagine a geo redundant application using one configuration of LDAP connectors. The LDAP Server pool might be configured as ['ldaps://server1', 'ldaps://server2', 'ldaps://server3'].

The problem is, that each redundant node has the same pool configuration. But each node is located on a different continent. NodeA might be located in Europe, NodeB might be located in North America.

Imagine I use the pooling strategy FIRST. Then I will get the following problem:

NodeA might perform great, since it starts querying the ldap-server1, which is also located in Europe. But NodeB, which is located in North America, will also start querying ldap-server1, which is located in Europe. It would be better, if NodeB (located in North America) would start querying ldap-server2, since this would also be located in North America.

OK, you could argue, that the redundant application should take care of destinct configuration per node. But wouldn't it be great, if ldap3 would come with another more sophisticated pooling strategy like:

Another sophisticated pooling strategy

TIME_WEIGHTED_FIRST: Records the response time of the servers in the pool continuously 
                     and always asks the server, which is the quickest.

We might start with a kind of round robin. Each time a query is sent, the response time is added to calculate an average response time per server. The server pool is ordered by response time. This way servers, that are slow (like servers located on another continent are queried last) and you will always get the quickest server first.

What do you think?

How can I assist?

cannatag commented 3 years ago

Hi Cornelius, this is a great idea! I work on this project just for bug fixing, but I’ll try to find some time for extending the pooling strategy with your suggestion.

The problem is that I don’t have a lab for the project anymore, so if you can help on this, it would be easier to test the strategy.

Bye, Giovanni

Il giorno 6 lug 2021, alle ore 16:08, Cornelius Kölbel @.***> ha scritto:

 Imagine a geo redundant application using one configuration of LDAP connectors. The LDAP Server pool might be configured as ['ldaps://server1', 'ldaps://server2', 'ldaps://server3'].

The problem is, that each redundant node has the same pool configuration. But each node is located on a different continent. NodeA might be located in Europe, NodeB might be located in North America.

Imagine I use the pooling strategy FIRST. Then I will get the following problem:

NodeA might perform great, since it starts querying the ldap-server1, which is also located in Europe. But NodeB, which is located in North America, will also start querying ldap-server1, which is located in Europe. It would be better, if NodeB (located in North America) would start querying ldap-server2, since this would also be located in North America.

OK, you could argue, that the redundant application should take care of destinct configuration per node. But wouldn't it be great, if ldap3 would come with another more sophisticated pooling strategy like:

Another sophisticated pooling strategy

TIME_WEIGHTED_FIRST: Records the response time of the servers in the pool continuously and always asks the server, which is the quickest. We might start with a kind of round robin. Each time a query is sent, the response time is added to calculate an average response time per server. The server pool is ordered by response time. This way servers, that are slow (like servers located on another continent are queried last) and you will always get the quickest server first.

What do you think?

How can I assist?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

cornelinux commented 3 years ago

Hi Giovanni,

that sounds great. I can setup two or three domaincontrollers in my network and try to slow down some of them to test any code. Does this sound like a plan?

Regards Cornelius

cannatag commented 3 years ago

It could be great, but we have to decide the right strategy for the timing. You said “We might start with a kind of round robin. Each time a query is sent, the response time is added to calculate an average response time per server. The server pool is ordered by response time”. But this works only if the queries are always the same, because different queries take different time to be executed. The timing can have two meanings: how fast (or slow) the connection is or how fast (or busy) the server is. In the first case we should measure how does it takes to send the message, in the latter the time of the response.

Or we can measure the time of a heartbeat message to all servers (something link an Abandon(0) operation) or the timing of a standard query on all servers, but the response probably would be cached by the server.

What do you think?

Il giorno 10 lug 2021, alle ore 22:45, Cornelius Kölbel @.***> ha scritto:

 Hi Giovanni,

that sounds great. I can setup two or three domaincontrollers in my network and try to slow down some of them to test any code. Does this sound like a plan?

Regards Cornelius

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

zorn96 commented 3 years ago

querying the root DSE (and not asking for attributes) is often referred to as an "LDAP ping" and is used by windows clients to find the fastest/closest controllers in a domain, so that could work as a good heartbeat.

I guess the right approach also depends on what exactly we're measuring for. if the goal is to find the server that's got the shortest RTT on the network, then making a query that's very likely to be cached (like the root DSE) is good because it largely eliminates server load as a factor. but if you also want server processing power and load to be things that get factored into choosing the "fastest" server to some extent (i.e. a more holistic evaluation) then it gets more complicated.

I'd wager that for the average use case, the former is a better approach than the latter, since network factors are more independent of the use case for which the LDAP protocol is being used, whereas the types of queries and stuff like server caching indexing can heavily impact the performance of different queries on different servers.

if we went the LDAP ping approach, we could make the strategy have a default lifetime for caching the RTT to each server in the pool, and add an additional constructor for instances of the strategy so a caller can set the time if they know intimate details of their network, like how often nodes on it reassess routes or how often traffic patterns/load changes

zorn96 commented 3 years ago

@cornelinux if you have access to an azure/aws/gcp setup, you could put controllers in different regions in order to guarantee slowness.

also having some servers in the pool use TLS while others use plaintext LDAP is also a good way to be reasonably sure queries to some (the TLS protected ones) will be slower than others without needing to change the servers at all. using certs with really big key sizes (4096+ bits) can slow TLS connections even further

cornelinux commented 3 years ago

querying the root DSE (and not asking for attributes) is often referred to as an "LDAP ping" and is used by windows clients to find the fastest/closest controllers in a domain, so that could work as a good heartbeat.

I guess the right approach also depends on what exactly we're measuring for. if the goal is to find the server that's got the shortest RTT on the network, then making a query that's very likely to be cached (like the root DSE) is good because it largely eliminates server load as a factor. but if you also want server processing power and load to be things that get factored into choosing the "fastest" server to some extent (i.e. a more holistic evaluation) then it gets more complicated.

I'd wager that for the average use case, the former is a better approach than the latter, since network factors are more independent of the use case for which the LDAP protocol is being used, whereas the types of queries and stuff like server caching indexing can heavily impact the performance of different queries on different servers.

if we went the LDAP ping approach, we could make the strategy have a default lifetime for caching the RTT to each server in the pool, and add an additional constructor for instances of the strategy so a caller can set the time if they know intimate details of their network, like how often nodes on it reassess routes or how often traffic patterns/load changes

I would also think, that taking the server load into account is not relevant. My initial intention was, to somehow find out, what is the "nearest" server in the list. Which usually would mostly depend on network latency and RTT. In my case I am doing LDAP searches for one user object. So I would simply have measured this time. But if you are doing a lot of different searches with different contents and result length, then the time differs not based on the distance to the server but due to the request. So in this case measuring a heart beat would make more sense.

However, I am not sure if such a heart beat request (aka timing request) would actually add overhead. So maybe this timing request would only be sent on each 5th or 10th search...

zorn96 commented 3 years ago

agreed, so the LDAP Ping approach (using the root DSE) would probably be the best bet for an actual strategy then.

people can always look at/copy paste the code and then manage multiple connections if they want to take a different approach (e.g. search for a group in order to rule out servers that don't have groups indexed, or use a search for supported SASL mechanisms to measure RTT and also rule out servers that don't support ntlm)

However, I am not sure if such a heart beat request (aka timing request) would actually add overhead. So maybe this timing request would only be sent on each 5th or 10th search...

does it make sense to do it on top of the requests themselves? that sort of assumes some constant rate of requests, and I'd think you'd want to use time as your metric for when to reevaluate network speed/topology, rather than request count. also, if many requests are being made in parallel on a connection then using request count could produce some weird behavior. it might break outstanding searches/paginated searches if we try to change server while they're ongoing as a result of reevaluating every Nth query. or if a connection is using kerberos auth, it'll result in reaching out to the kdcs to get a new ticket at seemingly random intervals (whenever we switch servers), which might cause unexpected behavior or performance impact because we're not taking kdc RTT into account and that might add significant overhead if we switch too often in terms of time.

so I'd think we'd want to assess based on a time interval. there would also need to either be a read/write lock setup for switching over the server, or a mechanic for paginated queries to remember what server they were querying, even once new queries switch over to a new server, so that they can continue iterating pages uninterrupted. the lock is probably simpler to implement, but the notion of memory for paginated queries would be more performant on a heavily utilized connection

cornelinux commented 3 years ago

I think I personally would not need to go with support for paginated queries. Honestly I do not expect that a server list will change its ordering too often. As mentioned, my simple thought was, that and administrator does not need to think about where his domain controllers are located. The admin would simply drop in the list of DCs and then let the logic find out, which LDAP server is located in the same country (fast) and which one is located on the dark side of the moon (slow and dark).

So going with a time interval is totally fine. We could even rule out some spikes in the average response time. But on the other hand I would not overcomplicate things in the first step.

The administrator or the programmer can still decide if he wants to try the new experimental without-any-field-experience pooling strategy TIME_WEIGHTED_FIRST or go with the old reliable FIRST.

This would be my pragmatic approach.