cityindex-attic / logsearch-purge-bot

Purges data from LogSearch
Apache License 2.0
0 stars 1 forks source link

delete_after = <some date in the future in UTC> #5

Closed mrdavidlaing closed 10 years ago

mrdavidlaing commented 10 years ago

Currently the purgebot deletes everything older than X days.

Whilst useful for general data (especially DEBUG data), there is some low volume data where keeping a longer history is very useful.

Tagging each record with a delete_after = would be a nicely decoupled way to implement this, since it puts the decision on when to delete in the shippers, which have the most knowledge of the data being shipped.

mrdavidlaing commented 10 years ago

Related - https://www.flowdock.com/app/cityindexlabs/elasticsearch-poc/inbox/655505

mrdavidlaing Oct 3, 2013 11:22 @bitpusher, @dpb587 - The document TTL setting looks like an interesting alternative to our logsearch-purge-bot. Could you evaluate further please: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html dpb587 Oct 3, 2013 14:06 We can provide an implementation with it, however I'd strongly advise against us using it. For high volumes like our logstash usage it's extremely inefficient to require elasticsearch to regularly search through all the data to find all the expirations before $now, even if we changed the default cleanup timer to 1d from 60s. In addition to the extra field everywhere, it's extra disk space for the data and indexing of that field for every message. The document deletion process is two-phased (whether via TTL or regular deletes) in that it marks documents deleted, but they won't necessarily be removed from disk until there's a sweep/optimze process. Also, the TTL expiration process doesn't get rid of the indices themselves, they stick around taking up runtime allocation memory in each of the nodes since they're still delegated somewhere despite being empty. A purge bot, which deletes entire indices, is significantly more efficient and takes care of removing the allocation and routing data from the cluster. Another point, which may not be currently useful, is that a purge bot could use queries to more selectively purge data (e.g. delete all debug messages >7d, delete all info messages >14d, delete all debug messages >28d, delete indices >60d). As an overall analogy, compare the TTL deletion process to a traditional database table where you would say DELETE FROM x WHERE expiration < TIME() on a table with 500m rows, whereas ALTER TABLE x DROP PARTITION '2013-10-03' is significantly cheaper and efficient.

dpb587 commented 10 years ago

I'm still opposed to the ttl/delete_after approach for the reasons mentioned. Additionally, such document-based fields should not be changed post-import making it inconvenient to change rules on pre-existing data. So here's an additional (more complex) idea.

We could have a rule configuration file which describes a few different purge configurations to operate with. For example,

{
    "general" : {
        "method" : "delete_type",
        "keep" : "30d",
        "options" : {
            "exclude_type" : [
                "ci_appmetrics"
            ]
        }
    },
    "appmetrics_long" : {
        "method" : "delete_type",
        "keep" : "1y",
        "options" : {
            "type" : [
                "ci_appmetrics"
            ]
        }
    }
}

So, as in the example, there'd be a delete_type purge method which essentially enumerates the indices and their types to consider whether they need deletion (based on keep and the parsed index date).

A cleanup step at the end of the bot could enumerate which indices no longer have any types in them to officially delete the index.

At a later point, we could add support for a query method which supports defining an elasticsearch query about which records to delete. It could be used to selectively remove records (e.g. @fields.type:DEBUG) at particular ranges.

By using a sort of configuration file, not only is it a bit more flexible, but it's also a bit easier to grok what data is going to be deleted and when, and it's easier for cluster participants to send PRs proposing alternative timeframes for the data they're responsible for.

mrdavidlaing commented 10 years ago

We could add support for a query method which supports defining an elasticsearch query about which records to delete. It could be used to selectively remove records (e.g. @fields.type:DEBUG) at particular ranges.

@bitpusher - Does the existing bot not already delete via a query across all indexes? Or am I misunderstanding how the bot works?

dpb587 commented 10 years ago

Oh, you're right, the bot isn't functioning as I was thinking and is currently using query-based deletions instead of deleting types and indices.

bitpusher commented 10 years ago

@mrdavidlaing & @dpb587 - i had initially tried a mass query against '_all' and it crashed the machine. after talking with Danny we concluded that memory and performance issues made this a bad call so i went to an index enumeration + delete by query + delete empty index. The last two steps are in anticipation for discretionary purging e.g. keep critical errors or infrastructure meta longer than mundane IIS logs. (purely contrived example).

i then noticed by accident that '__all' seemed to be working and gave it a shot but got inconsistent results so back to the enumeration. which, by the way, would have to be done anyway to clean up.

also, i concur with danny's 'plan' approach

mrdavidlaing commented 10 years ago

Architecturally I'm not in favour of "hardcoding" too much of the delete logic in the purge-bot; and whilst I appreciate your performance concerns I'm not convinced they will actually become a problem, since we're already deleting by query without obvious performance problems. At 0.10c / GB / month storage concerns are negligible.

Like the IoC & Polymorphism principals in OO programming, I believe that keeping the purge-bot a simple executor of delete queries, but having the specific parameters values of the deletion stored alongside the data will make for a more flexible system long term.

bitpusher commented 10 years ago

if @mrdavidlaing last comment is directly focused at my last:

the performance issue i spoke of was memory related and brought the VM down, so i was forced to try a different tack.

later, i accidentally found that it no longer brought the machine down so i revisited it. but this approach resulted inconsistent results. unfortunately i neglected to document the perceived anomalies and returned to the enumeration strategy which behaves as expected and places, i assume, a much lighter load on the server by breaking the purge into bite sized snacks.

mrdavidlaing commented 10 years ago

@bitpusher - if I'm not mistaken, you are still doing a "delete by query", but just doing it at the "index" level, rather than "across all indexes" level.

If so, then this seems to be the ideal balance; we get the "feature" of "delete by query" without the load of an "all" data query.

Since we shard by index, it will also scale well when we start spreading indexes across the ES cluster.

bitpusher commented 10 years ago

@mrdavidlaing, 'ideal'. precisely how i characterize my design decisions on a daily basis. lol. so glad you finally came around to listening to the fact that i am a genius. ;)

thus my confusion at your comment. it seemed to be giving me mixed signals. lol. chalk it up to lack of sleep.

sopel commented 10 years ago

Closed as Won't Fix due to project being retired to the CityIndex Attic.