canonical / loki-k8s-operator

https://charmhub.io/loki-k8s
Apache License 2.0
10 stars 16 forks source link

Clean up logs based on disk usage #131

Closed Abuelodelanada closed 1 year ago

Abuelodelanada commented 2 years ago

Bug Description

To prevent Loki from crashing due to being out of disk space in the persistent volume, we need to add some way of cleaning out older log entries once we surpass a certain threshold (say 80%) of the PVC max.

There is nothing built into Loki to facilitate this, so it will have to be done either directly in the charm or through a sidecar container. See: https://github.com/grafana/loki/issues/2314

The property would be added to the config.yaml options as maximum_retention_size and would be expressed as a percentage of 100, in string form, with the default value being "80%". In the charm init, when setting up the pebble layer, we would then use lightkube to get the max capacity of the pvc, calculate what the percentage would be in actual sizing measurements (ie. megs, gigs, terras) and use that as the threshold for sidecar container

Abuelodelanada commented 2 years ago

Right now I see five alternatives to address this.

1. Use new Log Entry Deletion endpoint.

2. Add a script that deletes old logs (here and here we have an example) executed by cron, in a sidecar container.

3. Add a script that deletes old logs (here and here we have an example) we have an example) executed by cron, in a charm, or loki container.

4. Add a method (could also be an action) in charm.py that is executed when and update-status event is fired.

5. Push a script/binary to the workload container that monitors disk usage and delete logs. (Running as pebble layer)

@simskij @mmanciop @rbarry82 Fresh, new comments and ideas are welcome! ;-)

rbarry82 commented 2 years ago

I very much like option 2 (either as a simple Go binary or a Python script, though the Go binary is an easier sidecar), but I worry about what happens if we're messing with the database while Loki is "live", since concurrent access may have undefined/questionable behavior.

What about:

  1. Add a sidecar/service with a Go binary/Python script which uses the Log Entry Deletion endpoint, then sleeps on a timer.
Abuelodelanada commented 2 years ago

Using Log Entry Deletion raises some concerns for me as this functionality is still experimental.

But let's suppose for now it is stable enough to use it: The deletion is based on start and end timestamps. So to implement a size based deletion we will have to find a way to know how many Mb we are going to delete with given time range. Reading the API endpoints docs this seems not to be easy.

Maybe another approach could be:

Talking about that a pebble service doesn't give us a good way to propagate alerts back, what about running this function in the update-status?

rbarry82 commented 2 years ago

Other than alerting on logs which it forwards to itself, which is... potentially an option, and doesn't risk long update-status intervals causing problems.

It certainly seems like we could adapt their chunk analyzer to directly map datetimes->sizes, and just run as a daemon which sends log messages to Loki (which we can alert on) if we have to trim as a way of propagation.

sed-i commented 2 years ago

find a way to know how many Mb we are going to delete [...] run a loop

Yes, the deletion can be an iterative process. For example:

  1. $ \Delta t = 1 \mathrm{hr} $, initial guess for deletion period
  2. $ [ t{\mathrm{oldest}},\ t{\mathrm{oldest}} + \Delta t ] $, deletion range
  3. $ \Delta s $, calculated change in storage
  4. $ \Delta t = 1.2 \cdot \frac{s}{\Delta s}\Delta t $, new guess for deletion period
  5. Repeat until freed enough space.

what about running this function in the update-status?

Iirc, the strongest argument against was that storage could be filling up much quicker than the update-status interval.

rbarry82 commented 2 years ago

find a way to know how many Mb we are going to delete [...] run a loop

Yes, the deletion can be an iterative process. For example:

  1. Δt=1hr, initial guess for deletion period
  2. [toldest, toldest+Δt], deletion range
  3. Δs, calculated change in storage
  4. Δt=1.2⋅sΔsΔt, new guess for deletion period
  5. Repeat until freed enough space.

Δs and Δt will be very hard to determine without knowing how much space they are actually taking (via chunk-util, probably, since it will be tied to how efficiently logs can be compressed and the cardinality, which will in turn depend on how many labels are in logs and the relative uniqueness of them). Unless logs are very homogenous, we can't take a good guess at this, and the only "reliable" way will be using Loki's own chunk tools to examine what's really on disk.

With a long-running script/binary which sleeps (rather than any kind of period re-execution window) $\vec{\dot{df}}$ can be calculated so long as the chunks are examined (even without deletion) at each iteration interval. Arguably, sampling rate should be high at process startup, but it can back off.

Something like:

  1. Startup: Calculate $\vec{\dot{df}}$ and available space on the PVC. Sleep 60 seconds. We don't know whether this is a new startup (there may not be chunk data at all on a "clean" Loki with no relations) or it may be a pod restart.
  2. Repeat 3-5 times. If $\vec{\dot{df}}$ is stable, check available space only, and flip a boolean (or send a message to a channel in Go) to stop calculating the rate.
  3. If avail_disk_space drops to some threshold (80%? 85%?), re-enable calculation of $\vec{\dot{df}}$. Determine a time interval which will free 5-10% of disk space and call the endpoint. Send a log to Loki which we can alert on.

Iirc, the strongest argument against was that storage could be filling up much quicker than the update-status interval.

Exactly this. update-status is not reliable enough.

(as an aside, it's super cool that I can use LaTeX/MathML her. Thanks @sed-i !)

lucabello commented 1 year ago

Some information: https://github.com/grafana/loki/issues/2314 has been closed; that issue led to the proposal in https://github.com/grafana/loki/issues/6876, which is addressed by https://github.com/grafana/loki/pull/7927.

That being said, this issue can be closed :)