Clean up logs based on disk usage

Abuelodelanada commented 2 years ago

Bug Description

To prevent Loki from crashing due to being out of disk space in the persistent volume, we need to add some way of cleaning out older log entries once we surpass a certain threshold (say 80%) of the PVC max.

There is nothing built into Loki to facilitate this, so it will have to be done either directly in the charm or through a sidecar container. See: https://github.com/grafana/loki/issues/2314

The property would be added to the config.yaml options as maximum_retention_size and would be expressed as a percentage of 100, in string form, with the default value being "80%". In the charm init, when setting up the pebble layer, we would then use lightkube to get the max capacity of the pvc, calculate what the percentage would be in actual sizing measurements (ie. megs, gigs, terras) and use that as the threshold for sidecar container

Abuelodelanada commented 2 years ago

Right now I see five alternatives to address this.

1. Use new `Log Entry Deletion` endpoint.

Pros:
- Easy to implement.
Cons:
- It's still experimental
- It is only supported for the BoltDB Shipper index store.
- The deletion is based on start and end timestamp.

2. Add a script that deletes old logs (here and here we have an example) executed by cron, in a sidecar container.

Pros:
- Writing our own way to delete old logs gives us a lot of flexibility.
Cons:
- In our Loki POD we'll have 3 containers (charm, workload and this one)...
- We'll have to rely the execution on something (cron) external to the Juju ecosystem.
- We'll need to figure out how to recreate the index. (Here we have an idea)

3. Add a script that deletes old logs (here and here we have an example) we have an example) executed by cron, in a charm, or loki container.

Pros:
- Writing our own way to delete old logs gives us a lot of flexibility.
Cons:
- We'll have to rely the execution on something (cron) external to the Juju ecosystem.
- We are assigning more than one responsibility to the container that runs the script.
- We'll need to figure out how to recreate the index. (Here we have an idea)

4. Add a method (could also be an action) in `charm.py` that is executed when and `update-status` event is fired.

Pros:
- Writing our own way to delete old logs gives us a lot of flexibility.
- We can also fire the deletion using an action.
Cons:
- The execution interval is fixed to update-status (By default 5min)
- We'll need to figure out how to recreate the index. (Here we have an idea)

5. Push a script/binary to the workload container that monitors disk usage and delete logs. (Running as pebble layer)

Pros:
- Writing our own way to delete old logs gives us a lot of flexibility.
Cons:
- We'll need to figure out how to recreate the index. (Here we have an idea)

@simskij @mmanciop @rbarry82 Fresh, new comments and ideas are welcome! ;-)

rbarry82 commented 2 years ago

I very much like option 2 (either as a simple Go binary or a Python script, though the Go binary is an easier sidecar), but I worry about what happens if we're messing with the database while Loki is "live", since concurrent access may have undefined/questionable behavior.

What about:

Add a sidecar/service with a Go binary/Python script which uses the Log Entry Deletion endpoint, then sleeps on a timer.

Pros:
- Object stores aren't supported yet anyway, and they have their own mechanisms for cleanup where we don't have to worry about manually watching "our" PVC
- Don't have to worry about concurrent file access when rebuilding the index
- If it's a single binary/script with time.sleep(), we don't even need to worry about an additional container, since it can just be another service in the pebble layer, packed as part of the charm and pushed on startup
Cons:
- Still doesn't have a good way to propagate alerts back unless we follow a "turtles all the way down" approach and have the binary/script itself log to Loki when it performs actions, then trigger alerts from Loki about Loki when Loki sees a log message matching some format (lots of Loki Loki Loki in one sentence)
- A little more scaffolding around charm lifecycle to make sure that it's actually part of the container -- easier than doing it on relation established with promtail, harder than just bundling it into the charm container itself

Abuelodelanada commented 2 years ago

Using Log Entry Deletion raises some concerns for me as this functionality is still experimental.

But let's suppose for now it is stable enough to use it: The deletion is based on start and end timestamps. So to implement a size based deletion we will have to find a way to know how many Mb we are going to delete with given time range. Reading the API endpoints docs this seems not to be easy.

Maybe another approach could be:

Store the Loki first start timestamp.
Once the volume reach the configured size threshold, run a loop that:
- Start deleting logs within a time range of, let's say, 10 minutes.
- Check if the volume is under the configured threshold. If so, exit the loop and update the stored timestamp.

Talking about that a pebble service doesn't give us a good way to propagate alerts back, what about running this function in the update-status?

rbarry82 commented 2 years ago

Other than alerting on logs which it forwards to itself, which is... potentially an option, and doesn't risk long update-status intervals causing problems.

It certainly seems like we could adapt their chunk analyzer to directly map datetimes->sizes, and just run as a daemon which sends log messages to Loki (which we can alert on) if we have to trim as a way of propagation.

sed-i commented 2 years ago

find a way to know how many Mb we are going to delete [...] run a loop

Yes, the deletion can be an iterative process. For example:

$ \Delta t = 1 \mathrm{hr} $, initial guess for deletion period
$ [ t{\mathrm{oldest}},\ t{\mathrm{oldest}} + \Delta t ] $, deletion range
$ \Delta s $, calculated change in storage
$ \Delta t = 1.2 \cdot \frac{s}{\Delta s}\Delta t $, new guess for deletion period
Repeat until freed enough space.

what about running this function in the update-status?

Iirc, the strongest argument against was that storage could be filling up much quicker than the update-status interval.

rbarry82 commented 2 years ago

find a way to know how many Mb we are going to delete [...] run a loop

Yes, the deletion can be an iterative process. For example:

Δt=1hr, initial guess for deletion period

[toldest, toldest+Δt], deletion range

Δs, calculated change in storage

Δt=1.2⋅sΔsΔt, new guess for deletion period

Repeat until freed enough space.

Δs and Δt will be very hard to determine without knowing how much space they are actually taking (via chunk-util, probably, since it will be tied to how efficiently logs can be compressed and the cardinality, which will in turn depend on how many labels are in logs and the relative uniqueness of them). Unless logs are very homogenous, we can't take a good guess at this, and the only "reliable" way will be using Loki's own chunk tools to examine what's really on disk.

With a long-running script/binary which sleeps (rather than any kind of period re-execution window) $\vec{\dot{df}}$ can be calculated so long as the chunks are examined (even without deletion) at each iteration interval. Arguably, sampling rate should be high at process startup, but it can back off.

Something like:

Startup: Calculate $\vec{\dot{df}}$ and available space on the PVC. Sleep 60 seconds. We don't know whether this is a new startup (there may not be chunk data at all on a "clean" Loki with no relations) or it may be a pod restart.
Repeat 3-5 times. If $\vec{\dot{df}}$ is stable, check available space only, and flip a boolean (or send a message to a channel in Go) to stop calculating the rate.
If avail_disk_space drops to some threshold (80%? 85%?), re-enable calculation of $\vec{\dot{df}}$. Determine a time interval which will free 5-10% of disk space and call the endpoint. Send a log to Loki which we can alert on.

Iirc, the strongest argument against was that storage could be filling up much quicker than the update-status interval.

Exactly this. update-status is not reliable enough.

(as an aside, it's super cool that I can use LaTeX/MathML her. Thanks @sed-i !)

lucabello commented 1 year ago

Some information: https://github.com/grafana/loki/issues/2314 has been closed; that issue led to the proposal in https://github.com/grafana/loki/issues/6876, which is addressed by https://github.com/grafana/loki/pull/7927.

That being said, this issue can be closed :)

canonical / loki-k8s-operator