Closed Abuelodelanada closed 1 year ago
Right now I see five alternatives to address this.
Log Entry Deletion
endpoint.charm.py
that is executed when and update-status
event is fired.update-status
(By default 5min)@simskij @mmanciop @rbarry82 Fresh, new comments and ideas are welcome! ;-)
I very much like option 2 (either as a simple Go binary or a Python script, though the Go binary is an easier sidecar), but I worry about what happens if we're messing with the database while Loki is "live", since concurrent access may have undefined/questionable behavior.
What about:
Log Entry Deletion
endpoint, then sleeps on a timer.time.sleep()
, we don't even need to worry about an additional container, since it can just be another service in the pebble layer, packed as part of the charm and pushed on startupUsing Log Entry Deletion
raises some concerns for me as this functionality is still experimental.
But let's suppose for now it is stable enough to use it: The deletion is based on start and end timestamps. So to implement a size based
deletion we will have to find a way to know how many Mb we are going to delete with given time range. Reading the API endpoints docs this seems not to be easy.
Maybe another approach could be:
Talking about that a pebble service doesn't give us a good way to propagate alerts back, what about running this function in the update-status
?
Other than alerting on logs which it forwards to itself, which is... potentially an option, and doesn't risk long update-status intervals causing problems.
It certainly seems like we could adapt their chunk analyzer to directly map datetimes->sizes, and just run as a daemon which sends log messages to Loki (which we can alert on) if we have to trim as a way of propagation.
find a way to know how many Mb we are going to delete [...] run a loop
Yes, the deletion can be an iterative process. For example:
what about running this function in the update-status?
Iirc, the strongest argument against was that storage could be filling up much quicker than the update-status interval.
find a way to know how many Mb we are going to delete [...] run a loop
Yes, the deletion can be an iterative process. For example:
- Δt=1hr, initial guess for deletion period
- [toldest, toldest+Δt], deletion range
- Δs, calculated change in storage
- Δt=1.2⋅sΔsΔt, new guess for deletion period
- Repeat until freed enough space.
Δs and Δt will be very hard to determine without knowing how much space they are actually taking (via chunk-util, probably, since it will be tied to how efficiently logs can be compressed and the cardinality, which will in turn depend on how many labels are in logs and the relative uniqueness of them). Unless logs are very homogenous, we can't take a good guess at this, and the only "reliable" way will be using Loki's own chunk tools to examine what's really on disk.
With a long-running script/binary which sleeps (rather than any kind of period re-execution window) $\vec{\dot{df}}$ can be calculated so long as the chunks are examined (even without deletion) at each iteration interval. Arguably, sampling rate should be high at process startup, but it can back off.
Something like:
avail_disk_space
drops to some threshold (80%? 85%?), re-enable calculation of $\vec{\dot{df}}$. Determine a time interval which will free 5-10% of disk space and call the endpoint. Send a log to Loki which we can alert on.Iirc, the strongest argument against was that storage could be filling up much quicker than the update-status interval.
Exactly this. update-status
is not reliable enough.
(as an aside, it's super cool that I can use LaTeX/MathML her. Thanks @sed-i !)
Some information: https://github.com/grafana/loki/issues/2314 has been closed; that issue led to the proposal in https://github.com/grafana/loki/issues/6876, which is addressed by https://github.com/grafana/loki/pull/7927.
That being said, this issue can be closed :)
Bug Description
To prevent Loki from crashing due to being out of disk space in the persistent volume, we need to add some way of cleaning out older log entries once we surpass a certain threshold (say 80%) of the PVC max.
There is nothing built into Loki to facilitate this, so it will have to be done either directly in the charm or through a sidecar container. See: https://github.com/grafana/loki/issues/2314
The property would be added to the config.yaml options as maximum_retention_size and would be expressed as a percentage of 100, in string form, with the default value being "80%". In the charm init, when setting up the pebble layer, we would then use lightkube to get the max capacity of the pvc, calculate what the percentage would be in actual sizing measurements (ie. megs, gigs, terras) and use that as the threshold for sidecar container