Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.36k stars 1.06k forks source link

Ingest old data into appropriate index file and/or discard #8854

Open sjwk opened 4 years ago

sjwk commented 4 years ago

Expected Behavior

When importing old log data (it might be from installing the sidecar on an existing server, or adding a new source of logs), old data should have the option of going into the approriately aged index file (or discarded if older than the oldest file) for a given index. Since data retention policy compliance is, as documented, down to the rotation policy of the index files, this would be required to ensure that data is deleted/archived as policy requires. I don't know whether it is even possible to do this or whether only the active index is writeable.

Current Behavior

Currently all ingested data, regardless of age, goes into the active index file. If, say, you ingest old data from a server for log entries 5 months old, and you have a daily rotation strategy, retaining for 180 days (for a 6-month retention policy), those log entries won't be deleted until they are 11 months old, leading to a potential compliance breach if the log entries contain personal information.

Possible Solution

It could be mitigated by client side filtering of what logs to send, so that only current logs are ingested and old logs ignored. But that might not always be desirable, or possible depending on the software sending the logs. It may also be possible to use a pipeline to filter old data but again discarding the data may not be desirable (compliance may require that data is retained for the specified period). An external tool could possible search for and move data between index files, or simply apply retention policies to delete data, but that would likely be slow and complicated as it would need to know how to identify the age of a record which might not have consistent fieldnames.

Steps to Reproduce (for bugs)

Context

I'm building a new graylog-based logging system, but want to identify and mitigate against all of the compliance issues before I start putting real data in and creating a risk of compliance breach,

Your Environment

dennisoelkers commented 4 years ago

Hey @sjwk,

thanks a lot for your input!

Just to understand the context around this request: What is the high-level compliance requirement that you want to implement using this feature?

sjwk commented 4 years ago

Hey @sjwk,

thanks a lot for your input!

Just to understand the context around this request: What is the high-level compliance requirement that you want to implement using this feature?

Hi @dennisoelkers Mostly GDPR compliance. We have published statements that identify how long we retain different types of data. And since my log sources include things such as door access control systems, printing logs, network traffic (which we're required to keep, identifying the individual user), there's personal data there. As such, we would be in breach if we retain that data longer than the stated time. I'm also told by our records manager that it's not interpreted as an 'up to' - if we say we're retaining data for 6 months, that's supposed to be 6 months, not 3 (and certainly not 10).

While the ideal would be to store in the approriately aged index files, I imagine however the easiest solution is to simply discard old data - in most cases it's only a one-off when adding a new server or source, and so there's still the ability to go back to the source to look at the old log files. But not all client software has the ability to filter.

dennisoelkers commented 4 years ago

Hey @sjwk,

thanks for the clarification. We will most probably work on the way we are doing rotation/retention right now and keep that use case in my mind as you are most certainly not the only one with it. This will take a while though, so the question would be what you could do in the meantime. If you are not ingesting chronologically in a continuous fashion, you would need some manual steps to prune your old data. E.g. writing a cron job that scripts deletion of outdated documents from your ES index using ES's delete by query API should not be too difficult. Is that something that could work out for you while we have not implemented this?

sjwk commented 4 years ago

That sounds reasonable, thanks for considering it as an improvement for the future!