apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

Explore using Jetty QoS handler to limit log message in the monitor #4877

Open keith-turner opened 2 months ago

keith-turner commented 2 months ago

Is your feature request related to a problem? Please describe.

When accumulo server processes send log messages to the monitor process its possible the log messages will continually build up in monitor memory. One thing that leads to this is that Jetty has an ubounded queue on its thread pool. So when the threads processing log messages can not keep up, the messages will start to build up.

Describe the solution you'd like

The jetty docs suggest using the QoSHandler to avoid the problem of too much data building up on the thread pool queue. It would be useful to know if the monitor could leverage this to avoid putting to many entries on the jetty thread pool queue.

Describe alternatives you've considered

If the QoSHandler is not workable, would need to see what else jetty has to offer.

Another thing to consider is client side back pressure or dropping log messages. When something appends using the AccumuloMonitorAppender.java it creates a future which it ignores. So each log append has no concept of what happened with the previous append. Not really sure what happens when the previous ignored futures are still in progress because the monitor is not keeping up. Does the client just keep queuing up stuff to send and adding more http request?

keith-turner commented 2 months ago

Did not set a milestone for this because was unsure what the changes would look like. If the complexity of the change is low then may want to consider fixing in 2.1

dlmarion commented 2 months ago

I remembered that I had made a comment somewhere about filtering in log4j. This is the comment. I wonder if this filter could be applied with the MonitorLoggerAppender in the configuration. If this works, then this would be a no-code solution that a user could apply on their own.

Update: It looks like Filters are to be used with Appenders, so this should be possible. Reference: https://logging.apache.org/log4j/2.x/javadoc/log4j-core/org/apache/logging/log4j/core/filter/BurstFilter.html

ctubbsii commented 2 months ago

I'm not opposed to pursing the addition of the QoS handler, if it's a small amount of code (which I suspect it would be), but I also think that if this is only an issue with the logging, then users should really implement their own filters to control that. I would like to discourage the use of the monitor appender as the primary means of collecting and viewing logs, or alerting on log events, as it's really just an item for convenience, not something that is really suitable for reliable log watching anyway, and there are much better appenders for proper log collection, monitoring, and alerting, that users should configure for their specific use cases in their specific deployed environment (although personally, I'm a big fan of just using a basic console appender, and collecting logs via journald or rsyslog, like any other Linux daemon service, rather than doing any kind of complex logging in the Java application itself).

ctubbsii commented 2 months ago

Not really sure what happens when the previous ignored futures are still in progress because the monitor is not keeping up. Does the client just keep queuing up stuff to send and adding more http request?

Client-side backpressure for logging to the monitor appender should be handled by the async logger feature (see -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector in our conf/accumulo-env.sh script and https://logging.apache.org/log4j/2.x/manual/async.html . There are properties for setting the discard threshold.

keith-turner commented 2 months ago

Client-side backpressure for logging to the monitor appender should be handled by the async logger feature (see -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector in our conf/accumulo-env.sh script and https://logging.apache.org/log4j/2.x/manual/async.html . There are properties for setting the discard threshold.

One thing I am uncertain about with the AsyncLoggerContextSelector is that in AccumuloMonitorAppender will create an async http task to send the data and not wait on the future. So maybe from log4js perspective appending is always instant and it just keeps queuing http sends on the client side.

keith-turner commented 2 months ago

if this filter could be applied with the MonitorLoggerAppender in the configuration. If this works, then this would be a no-code solution that a user could apply on their own.

One thing to consider about rate limiting on the client side is that it may require adjusting the log4j config based on the number of servers. For example if I want to limit to 100K log events per second and I have 100 servers then I would need to configure log4j in each server with a max rate of 1000. If I have 200 servers then I would need to set the max rate in log4j to 500 to get the same overall rate. If the rate could be configure in the monitor on the server side it would make configuring this simpler.

ctubbsii commented 2 months ago

So maybe from log4js perspective appending is always instant and it just keeps queuing http sends on the client side.

I believe this is true, provided you have the appropriate lmax disruptor library on the classpath. However, the client side queue should be configurable... I thought the discardThreshold configuration property would be helpful here, but it's only when the queue is full, it drops items of a certain log level... not a blanket drop of all items, unless we set the discard threshold to the max log severity, perhaps?

keith-turner commented 2 months ago

I dug into what this code is doing. Looking at the javadocs they mention an executor will be created if one is not supplied. This executor is used for the sendAsync request. The javadocs do not mention anything about what kind of executor is created. Looking at the impl of HttpClient in JDK17 it calls Executors.newCachedThreadPool. So it seems like the current code may create an http client thread for each log4j append request. Since the future is ignored, this could be a lot of threads. It also means that log4j will have no resistance or back pressure until something breaks from too many threads in the server process. So thinking the code should wait on that future and that would allow the log4j async mechanism to kick in.

ctubbsii commented 2 months ago

So thinking the code should wait on that future and that would allow the log4j async mechanism to kick in.

I'm having trouble understanding how waiting on the future would have anything to do with the number of threads. The future represents the executable task, not the thread, right? Since this is a best effort, we shouldn't care if the task finished or not. My expectation for the httpClient is that it will maintain threads/connections internally and that is largely abstracted from us. However, this is not the log4j async mechanism. That's just how the appender works when it has events to append to the connection. The log4j async works at a level above this, prior to entering the appender, and is handled by the lmax disruptor threads/executors for the log messages, not the http messages that they get translated into.

The log4j async mechanism is supposed to drop messages after a best effort append. We could fail more often in this http code in response to backpressure from the http connection, to let it do that, but I think that would end up resulting in a lot of retries in the log4j code, without additional config to limit them.

The other aspect I'm not sure about is whether the log4j async queues are actually filling up. If so, a user might want to more agressively drop stuff from the async queues at the max log level threshold. I think by default it only drops stuff at INFO and lower. That should be a user configuration change, though, not a code change.

So, I think there's potential optimization in the client:

  1. User configuration of async logging options,
  2. HttpClient executor threads/future handling,
  3. Applying backpressure from the HttpClient-based Appender by throwing exceptions back out to the log4j framework to be handled by the log4j async stuff.

The potential optimization in the server is:

  1. Reduce thread contention by using a better concurrency model or data structure, instead of a synchronized method that gets called by everybody and blocks even simple things like string concatenation/building, (#4876)
  2. Implement QoS handler to rate limit the flood of incoming events (#4877, this issue),
  3. Drop deduplication feature or alter it in some other way to remove the need for concurrency (also discussed on #4876)
keith-turner commented 2 months ago

The log4j async mechanism is supposed to drop messages after a best effort append.

That sendAsync call is not to log4j, its to an HttpClient object. The http client sendAsync call happens in the log4j impl of an appender.

keith-turner commented 2 months ago

I'm having trouble understanding how waiting on the future would have anything to do with the number of threads. The future represents the executable task, not the thread, right?

AFAICT each call to sendAsync in the http client will create a task on an executor that is created using Executors.newCachedThreadPool. So the append method always returns instantly and always creates a thread.

ctubbsii commented 2 months ago

I'll look into the QoS Handler. The issue regarding the Future and its implications on the log4j2 async stuff in the logger client code is a separate issue, largely unrelated to the QoS Handler, and requires some extra investigation I think.

ddanielr commented 1 month ago

The other aspect I'm not sure about is whether the log4j async queues are actually filling up. If so, a user might want to more agressively drop stuff from the async queues at the max log level threshold. I think by default it only drops stuff at INFO and lower. That should be a user configuration change, though, not a code change.

I tested this by setting the asyncQueueFullPolicy to Discard and the discardThreshold to ERROR. https://logging.apache.org/log4j/2.x/manual/async.html#log4j2.asyncQueueFullPolicy

The monitor still fell over with OOM errors under a high level of log messages so I don't think the queue is filling up.

cshannon commented 1 month ago

Has anyone taken a heap dump to view what actual objects are in memory that are using up all the space? I usually use the Eclipse MAT to view heap dumps and see what's going on.