gchq / stroom

Stroom is a highly scalable data storage, processing and analysis platform.
https://gchq.github.io/stroom-docs/
Apache License 2.0
435 stars 53 forks source link

Slow DNS name resolution delays back-end process completion #2025

Open p-kimberley opened 3 years ago

p-kimberley commented 3 years ago

Summary

In an environment where DNS lookups might take time to resolve, or fail as a result of a timeout, the user will experience delays in performing various actions via the front-end, such as exporting an object or listing configuration properties.

The delay is caused by the logging subsystem, which attempts to perform a reverse DNS lookup to resolve the hostname of the request client.

Reproduction

Deploy Stroom to an environment like Kubernetes where:

  1. Processing nodes see request client IPs as a cluster-local IP address (such as 10.42.5.1) that is not resolvable to a hostname.
  2. Upstream DNS servers are configured to wait a period of time before responding (e.g. with NXDOMAIN or SERVFAIL).

Where the delay in reverse DNS lookup at the upstream server is significant (e.g. 3 seconds), the result will be frequent, noticeable delays to actions performed by the user.

Root causes:

  1. StroomEventLoggingService::getClient() attempts to do a DNS lookup each time it is invoked. As the result is not cached, a lookup is performed every time a user event is generated
  2. ServletRequest.getRemoteAddr() is returning the actual client IP. In cases where Stroom nodes are behind a reverse proxy or load balancer, this will not be the original client IP
  3. DefaultEventLoggingService.log() is called synchronously and as such, the caller waits until the hostname lookup is completed and event raised

Recommendations:

stroomdev66 commented 3 years ago

Added caching for IP address to hostname lookups. X-FORWARDED-FOR is now used to try and get the client address. Asynchronous event logging will require more significant work as the event creation code would need to be asynchronous.