Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.33k stars 1.05k forks source link

Lookup Table Cache caching NULL values, never clears unless Expire After time elapses #13579

Closed drewmiranda-gl closed 12 months ago

drewmiranda-gl commented 1 year ago

The behavior of how Lookup Table Cache works can lead to some unexpected (and unwanted) outcomes. For example, if a lookup value does NOT EXIST in cache currently, and the data adapter returns a null value (either because its inaccessible or doesn't have a value to return), the Lookup Table cache saves that value. All subsequent lookups for that value will return null/empty, even if the data adapter can correctly serve the request. Only after one of the "Expire after" conditions is met (such as Expire after access, defaults to 60s, or Expire after write, defaults to Never).

This creates an unwanted condition where a momentary failure of the data adapter causes graylog to be unable to lookup that value, potentially forever. For example, if the cache is set for Expire after access: 60s and logs for that lookup value continually are sending requests to the lookup table (and thus its cache), it will always return null/empty.

I observed this behavior with my Graylog Lab and a reverse DNS lookup table, where unexpectedly a lookup table would abruptly stop returning a lookup value for a given IP, despite the DNS server (and thus the data adapter) having a valid value to return.

Expected Behavior

This is tricky because the caching of NULL/empty is super helpful for performance reasons. My expectation is that as soon as that lookup value is available again, it is populated.

Current Behavior

A null value is cached and if logs continue to lookup from that lookup table, the cache will never expire, thus always returning a null value.

Possible Solution

A workaround on the client/customer side is to enable "Expire after write". However, this makes the cache less useful since its emptying itself at a regular interval.

Not caching NULL values will also lead to negative performance impact, especially when looking up large amount of DNS queries where the same IP may not have a result to be found.

IMO it could be helpful (but not sure if possible) for the cache to accommodate a NULL value when the DNS server doesn't have a result, VS caching NULL in the event of a DNS server timeout. It could be helpful to not clear/overwrite an existing cache item if the current result is NULL but there was a valid previous last value.

A more complex (and likely infeasible) approach could be to have a scheduler periodically check the contents of the cache to see if values have changed and invalidate the cache. This gets a bit more complex though than a vanilla data adapter.

Steps to Reproduce (for bugs)

  1. Setup look up table with DNS Lookup (Reverse lookup, PTR), and Node-local, in-memory cache
  2. Using the edit page for the Lookup table, lookup a value that doesn't exist on the DNS server
    • image
  3. Add a record to the DNS server so the lookup table will succeed if queried again (after the cache expires!)
  4. Using the data adapter edit page, query the IP again, validate you get a DNS result
    • image
  5. Query from the lookup table edit page again. Observe the lookup result is still NULL
    • image
  6. If you continue to query the lookup result will always be null. Only after waiting a full 60s (for expire after read to expire) do you get a result for the lookup table query
    • Alternatively, purging the key from cache (or even purge all) does the same thing as waiting for cache to expire

Context

In this specific context i'm using rDNS lookup to identify devices on my network that don't have hostnames assigned. This works great except during erroneous DNS server timeouts, and then the NULL value is cached leading to repeated logs with no rDNS lookup.

Your Environment

As always, happy to discuss further.

patrickmann commented 12 months ago

Closing, since we also have #15200 and don't need 2 issues for tracking.