chriso / gauged

A storage layer for numeric data that changes over time
MIT License
337 stars 11 forks source link

Cleanup by key #4

Closed niphlod closed 9 years ago

niphlod commented 10 years ago

Hi, once the data is inserted, is there a way to delete all the data (or part of it) belonging to only one key ? I'm looking for a function that clears the data older than x days, etc, for maintenance purposes, e.g., delete all 'requests' data older than 2014-01-01. It seems that there's a clear_from() but it'll wipe from a timestamp "onwards", and for ALL keys ...

chriso commented 10 years ago

Not at the moment.

The clear_from method is part of the writer and should be used if you're ingesting data from logs and ever need to go back in time to correct a mistake or combine logs (e.g. if you move servers). You could correct the logs, clear_from(some_timestamp) and then the start the importer again (which would start from the timestamp provided by writer.resume_from()). I'll document that side of the library at some stage.

niphlod commented 10 years ago

uhm, ok. I'll try to work something up as soon as I get a little bit of time. Without being able to "focus" deletions on a key, I'd need to work on a "temporary" gauged store and then fetch all the data from there and insert it to a new "stable" store. Tried to work it around with namespaces (e.g. 0 = "temporary", 1 = "stable", 2 = "work in progress") but still I'd have to go through all keys in that namespace:

seems a taddle bit too much ;-D

chriso commented 10 years ago

It'd be easier to just export a new method to delete a key (and optional timestamp, where it deletes <= timestamp), which then passes it down to the driver, e.g. DELETE FROM gauged_data WHERE key = %s and offset <= %s. The only other issue here is that the gauged_cache table doesn't have a key column (it ends up being part of the hash column, along with the aggregate, etc.) and so you can't deleted cached counts for that key only. Adding a key column and then deleting from that table also would solve the issue.

Something like

def clear(key, before_timestamp=None):
   # ...
niphlod commented 10 years ago

that's exactly what I needed, but ALAS, I'm a newcomer to gauged ^_^

niphlod commented 9 years ago

ok. I guess better late than ever....... I got a working implementation of

def clear(self, key, namespace=None, before_timestamp=None):

I still have to face the gauged_cache issue, but I'm facing a different one: how to update properly statistics... I can't see a smart way to recompute statistics once a key is, e.g., dropped. @chriso : Any idea ?

chriso commented 9 years ago

@niphlod the statistics are per namespace rather than per key so it's not possible to patch the statistics after removing a single key.

niphlod commented 9 years ago

is that going to be a problem, aside from the fact that statistics wouldn't be up to date ?

chriso commented 9 years ago

I don't think it's a major issue, as long as it's documented.

niphlod commented 9 years ago

ok. can you take a look at this before I start iterating on other adapters ?

https://github.com/niphlod/gauged/tree/fix/4

if you're good with the new methods "api", and there's nothing wrong with the code, I'll add tests and docs too.

chriso commented 9 years ago

That's looking great, thanks!

niphlod commented 9 years ago

you're the one who had some really great ideas. I'm just trying to add small bits to make it more manageable/flexible.

BTW: the change in schema would probably clash with the current one: I'm aware of that and we'll discuss it further (I saw that the layout is there already, but needs additional care)

I'll post a PR soon.

chriso commented 9 years ago

Fixed by #5.