Closed glerchundi closed 2 years ago
Hmm... I could have sworn that we had an open issue for this, but I can't seem to find it. Others have requested it, but it hasn't formally made it onto a roadmap. Thanks for submitting! We'll leave this issue open to track the request.
We'd (Tierion) make use of this feature. We write lots of ephemeral data into CRDB which we then later have to delete which has been the source of performance issues.
@grempe thanks for sharing.
@dianasaur323 just to let you know, we're developing a new product and at the same time evaluating the possibility to use cockroach as our primary datastore for nearly everything:
INSERT INTO dlocks VALUES (token, expire);
UPDATE dlocks SET expire=new_expire WHERE token=x;
DELETE FROM dlocks WHERE expire<now();
This in conjunction with JSON, row-level partitioning & the upcoming CDC support (#2656) would make CockroachDB a really awesome solution for us!
@grempe so sorry for not responding earlier - this must have just gotten lost in my github firehose. Thank you for using us! A nearer term option for TTL is to actually use our partitioning feature to support a fast drop of a partition. Does that sound like something that would work for you?
@glerchundi hello! I believe my co-worker has been talking to you about a couple other things as well. It's great to hear that you find a bunch of features useful to you. The same question applies to you - would a bulk delete of a partition work for you in terms of TTL? Also, we have a really simple prototype for CDC that will be merged into the master branch soon. Would you like to beta test that to see if it's inline with what you will need?
@glerchundi hello! I believe my co-worker has been talking to you about a couple other things as well. It's great to hear that you find a bunch of features useful to you.
Yep!
The same question applies to you - would a bulk delete of a partition work for you in terms of TTL?
In case I didn't understand well let me rephrase with my own words. The idea would be to row-level partition based on a column and a condition like: create partition expired-keys on dlocks table where expiration column value is less than now()? And then periodically bulk delete every row in expired-keys partition?
Also, we have a really simple prototype for CDC that will be merged into the master branch soon. Would you like to beta test that to see if it's inline with what you will need?
Sounds interesting but we'll be very busy this month. In case soon means more than one month definitely we'll be interested. In case it's less we'll keep an eye on the master's progress :)
Thanks for the update @dianasaur323, btw cool nickname :)
In case I didn't understand well let me rephrase with my own words. The idea would be to row-level partition based on a column and a condition like: create partition expired-keys on dlocks table where expiration column value is less than now()? And then periodically bulk delete every row in expired-keys partition?
So the actual implementation of this is still up in the air, but yes, kind of. The idea would be that you could set an expiration column value for every month or something, so CockroachDB would co-locate entries that occurred in a given month, and then you would be able to bulk delete that partition instead of iterating through every row in order to delete the data. That being said, this might cause a certain range to be really hot, so more thinking is needed here. I guess the question here is how hands off do you need TTL to be, and are you okay with an asynchronous delete job? We haven't begun development on this, although are exploring how we can leverage existing features to meet these use cases.
Sounds interesting but we'll be very busy this month. In case soon means more than one month definitely we'll be interested. In case it's less we'll keep an eye on the master's progress :)
No worries - we are likely going to have something early to share within the next two weeks. Since you don't have time in this month, we can always ping you when we have a second iteration to see if you have time then!
I'm a huge fan of dinosaurs :)
hey guys, what is the status on this?
Zendesk ticket #3577 has been linked to this issue.
In pixeldrain.com I'm using the exact locking system which @glerchundi described for counting unique views on uploaded files. It works pretty well, but the removal query takes a few seconds to complete. I'm afraid this won't scale well if there are millions of rows to be deleted. A row TTL feature would be very useful for me.
Deletes don't scale well on massive tables, so if Cockroach implements this they are going to have to do something other than delete in my opinion. Cleaning up data on huge tables at the moment is very difficult and basically requires you to stop all traffic to the db first.
One way to handle this might be to support automatic deletion when the primary key is a timestamp, or is a multi-column primary key containing a timestamp. In that situation it might be possible to free whole ranges when they fall outside of the TTL window. The engine would also need to ignore "dead" rows within ranges that still contain "live" rows, which would create some probably negligible overhead from skipping the group of dead rows.
Disclaimer: I don't know enough about CRDB's K/V system to know if this actually could work or not. It would be awesome to hear some feedback on whether something like this approach would be feasible.
One challenge I imagine here is foreign keys and secondary indexes. Ideally we’d push the TTL into the KV, perhaps via something like zone configs. Ideally we’d utilize MVCC timestamps to determine whether a value is dead due to TTL but that doesn’t really work because of the need to maintain referential integrity.
I don’t see an obvious way to implement this TTL efficiently for rows which are indexed in secondary indexes or part of fk relations. I imagine that’s why you see this functionality in nosql database like Cassandra and not in relational databases.
Perhaps one could do it by maintaining a separate index over timestamps for rows but that seems like a pretty big hammer.
I think STORING indexes (or something like them) could provide an elegant solution here. In brief, the idea is that if a table has a TTL, all of its indexes must STORE the crdb_internal_mvcc_timestamp
column (probably not literally with the current STORING index code due to the magic of that pseudo-column, but something equivalent). This means that every write to the table needs to update every index (not just ones that store the changed columns). Then, because we know that the primary and all secondary indexes will have matching timestamps, we can drop all values retrieved from the KV layer that are older than the TTL just by looking at the query timestamp and table descriptor. That frees us to make the GC asynchronous and independent - we can delete the primary KVs separate from the secondary indexes, but the SQL layer code will never see one without the other.
Regarding foreign keys, I agree this seems tricky and I think we'll need to disallow certain combinations. For example, you can't have an FK where the referred table has a shorter TTL than the referring table. But I think this solution would let you have e.g. a set of tables with the same TTL and FK relationships between them.
Just bumping this up, is there any road map to include this in future releases ?
Dear users who are hoping for this feature: while we're working this out, there's a neat workaround, which is documented here.
Alternatively to @bdarnell's comment, we could avoid storing crdb_internal_mvcc_timestamp
on every index, but continue to transparently include a WHERE crdb_internal_mvcc_timestamp > now() - TTL
on all queries against the table via the optimizer. This causes an index join against the primary key, so you don't have to worry about not having the most up-to-date timestamp. The downside is that it prevents users from using covering indexes.
As far as the cleanup process, check out this neat little SQL snippet:
create table b (a primary key, b) as select g, g from generate_series(1,100000);
WITH y AS (SELECT a FROM b WHERE a > $1 LIMIT 1000),
d AS (
DELETE FROM b
WHERE a IN (SELECT * FROM y)
AND crdb_internal_mvcc_timestamp / 1000000000 < now()::INT8 - 999
LIMIT 1000
RETURNING 1
)
SELECT last_value(a) OVER (ORDER BY a RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM y
LIMIT 1;
this scans at most 1000 rows, deletes at most 1000 rows if they're older than a sample 999-second TTL, and returns the PK of the last row scanned. The 1000 constants can be adjusted, and this would also work with composite primary keys with some tweaking.
You could imagine having a job that runs this, persists the result, and replaces the $1 placeholder with the result of the query for the next run.
There are at least three important requirements to work out here:
- Are TTLs related to row insert or row update? It's much easier and more natural to do the latter for us.
I'm not sure this is true -- however, implementing the former is easy enough without anything from us.
Other important questions:
1) Is it okay if the cleanup has roughly the overhead of writing rows in the first place? 2) Do you need or want CDC events when rows are removed via TTL?
These two questions point in different directions for implementation. We've heard both. If you cannot stomach cleanup that costs as much as insertion, you almost definitely can't stomach a CDC event on cleanup. How important is the efficiency of the TTL here?
Regarding foreign keys, I agree this seems tricky and I think we'll need to disallow certain combinations. For example, you can't have an FK where the referred table has a shorter TTL than the referring table. But I think this solution would let you have e.g. a set of tables with the same TTL and FK relationships between them.
@bdarnell how did you envision dealing with later writes to the referencing table? @RaduBerinde raised the point that the referenced row may expire before the referencing row. Do we need a touch-write on the referenced row? That doesn't feel very intuitive.
I think I was getting my TTL semantics mixed up and thinking of our MVCC GC TTL (which keeps all versions newer than the TTL plus one older version) instead of what's being proposed here (which doesn't try to keep the version that was active as of the TTL). (is there another term we could use for one of these instead of saying "TTL" both times?)
I don't see a better solution offhand than touching the referenced row, which seems like a terrible potential bottleneck. I'd rather not get into that situation, so at least for v1 of this feature I'd disallow FKs that refer to tables with a TTL.
I think most of your customers (like me) will be using this feature on tables for like log entries. So historical data just drops off. We have 700GB tables where I wish old records would just fall off and not require complicated flows to delete them.
In our case these tables never have relationships, they're more like a heap than anything.
Hi again!👋
Just wanted to share with you that TiDB crew is presenting this feature in KubeCon. Here you'll find the slides just in case there is something valuable on them: https://static.sched.com/hosted_files/kccnceu2021/cb/TTL%20in%20TiKV.pdf
I suppose that removing the dependency with RocksDB has its benefits but at the same time its drawbacks of not getting the features that it supports? I don't know if Pebble supports TTLs based autopurges though.
RocksDB's TTL support doesn't provide a foundation for implementing TTLs in CRDB. Doing compaction time reclamation of keys is at odds with other parts of the CRDB architecture, though we're working towards making this possible (e.g. for compaction time garbage collection of old keys). Also, the linked presentation doesn't talk about the difficulties of maintaining relational consistency between SQL tables which have TTLs. I wonder how TiDB is handling that.
Would still love to see this!
Friendly ref. about Google Cloud Spanner releasing today TTL based rows deletion: https://cloud.google.com/blog/products/spanner/reduce-costs-and-simplify-compliance-with-managed-ttl-in-spanner
this is now in preview support for v22.1!
Thanks! This is huge!
It would be interesting to have a possible row-level TTL so it can be used as an auto-purgable event store.
WDYT?
Jira issue: CRDB-5938