Avoid spamming log with DefaultCompactionPlanner messages when exceeding max files and no files to compact found.

EdColeman commented 7 months ago

The DefaultCompactionPlanner emits a warn message every time that files cannot be found to compact when the max file ratio has been exceeded. The rate these messages can be generated can be successive.

The message that is causing the issue:

        log.warn(
            "Attempted to lower compaction ration from {} to {} for {} because there are {} files "
                + "and the max tablet files is {}, however no set of files to compact were found.",
            params.getRatio(), highRatio, params.getTableId(), params.getCandidates().size(),
            maxTabletFiles);

ddanielr commented 7 months ago

also update the ration to ratio spelling

dlmarion commented 6 months ago

I think there may be a no-code solution to this when using Log4J2. Log4J2 has the concept of Filters, which can be applied to a specific Logger (see bullet 2). A BurstFilter can be configured by the user to rate limit the number of messages that they are logging. An example would be something like:

<Logger name="com.example.SomeClass" level="INFO">
    <BurstFilter level="WARN" rate="10" maxBurst="20" />
</Logger>

keith-turner commented 6 months ago

I think there may be a no-code solution to this when using Log4J2. Log4J2 has the concept of Filters, which can be applied to a specific Logger (see bullet 2). A BurstFilter can be configured by the user to rate limit the number of messages that they are logging.

I think we could remove some of the suppression that has already been added to the code in favor of this. We may want to use more specifically named loggers for known bursty logging to allow targeted suppression that does not suppress other unrelated log messages.

dlmarion commented 6 months ago

CompositeFilters also exist, I wonder if you could combine a RegexFilter and a BurstFilter on a specific Logger.

EdColeman commented 6 months ago

Are there any logging performance considerations?

Particularity when we know ahead of time that a message could be quite spammy in certain situations? Defeating the message in code (at the source) when it is known to be desired is going to remove any work necessary to format, output, collect and then filtering that would need to be done by the logging system.

In places where you cannot modify the source, it may be a useful thing to deploy. But, when known upfront and fully controlled by us, it may be better just to adjust the code to log reasonably in cases like this.

Keith had what appeared to be a general, good solution to something that we likely could use in multiple places.

dlmarion commented 6 months ago

Are there any logging performance considerations?

I'm not sure

dlmarion commented 6 months ago

One issue with suppressing messages in the code vs the configuration, is that we could suppress a single message that might help the user track something down. Say that an issue is happening with a specific tablet and they are tracking that, but missing messages about that tablet because they were suppressed. I think there are pros & cons to both approaches.

dlmarion commented 6 months ago

@keith-turner @EdColeman - I created #4488 as another possible solution. This would allow you to create two Logger variables in a class and use one variable to always log and use the other variable to log on some interval. This requires no configuration changes and possible performance penalties with using a BurstFilter and does not require a Cache like in Keith's commit.

EdColeman commented 6 months ago

I can see #4488 being useful for something like status reporting - one example that I can think of would be something like a loop that checks status (say its tservers registered, compactions running,...) reporting 'okay' at an interval less that the loop frequency. The loop frequency (to catch changes, issues) could help reduce noise in the logs. Particularly if the log information was verbose (say it contains a snapshot of something) - the snapshot could be useful to know some state at a given time so that subsequent events could be pieced together.

What follows is me trying to understand what we are trying to solve, and explicitly stating it if helps others, apologizes if this is just repeating the obvious.

For the original issue, I think we need to step back and outline the requirements better. Maybe along the lines of:

each message occurrence is logged when first seen.
subsequent occurrences of the "same" message are suppressed and only logged at N occurrences (or every T interval)
(optional) report number of times seen (or first seen) since last reported.

We also need to mind what each message occurrence means. For example "Tablets un-hosted" if the message includes the tablet id - then each unique 'tablet x un-hosted, tablet y un-hosted' are unique messages. In this case we would want to log the first occurrence of each, and not lump all un-hosted messages into the same pile. We also would likely not want to just print the first N un-hosted messages and then suppress the rest for a given period (burst suppression?).

The cache solution seems to meet most of the requirements as long as the cache sizes can be bounded. Caching seems most appropriate when there would be a limited, bounded number of messages that would be generated over the cache hold interval. If a message was generated for every tablet of a large table, then either the cache could grow very large, or if we also limit the size (number of messages held) then it would bypass the suppression. So many messages could be added that on the next round, the first messages would have been removed and considered as new again.

In general, would we be better off looking at the error handling side of things rather than focusing on the message? The origin of the issue is something that is reporting as a warning - so that at a minimum, the first message should be logged. Repeating at an interval would help to make sure that, if the problem persists, it makes it easier to discover in the logs. Especially, if there could be a large gap when the issue occurred and when it is discovered as an issue.

In this case, if the compaction ratio and files available are incompatible - what, short of changing the ratio would change the situation? How fast is that likely to change? Would it be appropriate to reduce the check interval once the error has been detected - maybe with a back-off?

dlmarion commented 6 months ago

subsequent occurrences of the "same" message are suppressed and only logged at N occurrences

I think I lost sight of this, suppressing exact duplicates. There is a similar todo in elasticity that talks about message spamming (below). I don't know if these requires duplicate suppression or not.

https://github.com/apache/accumulo/blob/elasticity/server/base/src/main/java/org/apache/accumulo/server/compaction/CompactionJobGenerator.java#L94-L98

apache / accumulo

Avoid spamming log with DefaultCompactionPlanner messages when exceeding max files and no files to compact found. #4409