storage+sql: Feature request: Row-level TTLs

glerchundi commented 6 years ago

It would be interesting to have a possible row-level TTL so it can be used as an auto-purgable event store.

WDYT?

Jira issue: CRDB-5938

dianasaur323 commented 6 years ago

Hmm... I could have sworn that we had an open issue for this, but I can't seem to find it. Others have requested it, but it hasn't formally made it onto a roadmap. Thanks for submitting! We'll leave this issue open to track the request.

grempe commented 6 years ago

We'd (Tierion) make use of this feature. We write lots of ephemeral data into CRDB which we then later have to delete which has been the source of performance issues.

glerchundi commented 6 years ago

@grempe thanks for sharing.

@dianasaur323 just to let you know, we're developing a new product and at the same time evaluating the possibility to use cockroach as our primary datastore for nearly everything:

Distributed locking which requires:
- Creating new locks: INSERT INTO dlocks VALUES (token, expire);
- Renewing new locks: UPDATE dlocks SET expire=new_expire WHERE token=x;
- Remove expired locks: DELETE FROM dlocks WHERE expire<now();
Distributed dynamic discovery: Internal DNS which requires expirable DNS records based on those TTLs.
Automatic purgation of events generated by our system & devices (IoT devices). We're going to limit our system with two different restrictions: by size and by time (with a TTL for every event).

This in conjunction with JSON, row-level partitioning & the upcoming CDC support (#2656) would make CockroachDB a really awesome solution for us!

dianasaur323 commented 6 years ago

@grempe so sorry for not responding earlier - this must have just gotten lost in my github firehose. Thank you for using us! A nearer term option for TTL is to actually use our partitioning feature to support a fast drop of a partition. Does that sound like something that would work for you?

@glerchundi hello! I believe my co-worker has been talking to you about a couple other things as well. It's great to hear that you find a bunch of features useful to you. The same question applies to you - would a bulk delete of a partition work for you in terms of TTL? Also, we have a really simple prototype for CDC that will be merged into the master branch soon. Would you like to beta test that to see if it's inline with what you will need?

glerchundi commented 6 years ago

@glerchundi hello! I believe my co-worker has been talking to you about a couple other things as well. It's great to hear that you find a bunch of features useful to you.

Yep!

The same question applies to you - would a bulk delete of a partition work for you in terms of TTL?

In case I didn't understand well let me rephrase with my own words. The idea would be to row-level partition based on a column and a condition like: create partition expired-keys on dlocks table where expiration column value is less than now()? And then periodically bulk delete every row in expired-keys partition?

Also, we have a really simple prototype for CDC that will be merged into the master branch soon. Would you like to beta test that to see if it's inline with what you will need?

Sounds interesting but we'll be very busy this month. In case soon means more than one month definitely we'll be interested. In case it's less we'll keep an eye on the master's progress :)

Thanks for the update @dianasaur323, btw cool nickname :)

dianasaur323 commented 6 years ago

In case I didn't understand well let me rephrase with my own words. The idea would be to row-level partition based on a column and a condition like: create partition expired-keys on dlocks table where expiration column value is less than now()? And then periodically bulk delete every row in expired-keys partition?

So the actual implementation of this is still up in the air, but yes, kind of. The idea would be that you could set an expiration column value for every month or something, so CockroachDB would co-locate entries that occurred in a given month, and then you would be able to bulk delete that partition instead of iterating through every row in order to delete the data. That being said, this might cause a certain range to be really hot, so more thinking is needed here. I guess the question here is how hands off do you need TTL to be, and are you okay with an asynchronous delete job? We haven't begun development on this, although are exploring how we can leverage existing features to meet these use cases.

Sounds interesting but we'll be very busy this month. In case soon means more than one month definitely we'll be interested. In case it's less we'll keep an eye on the master's progress :)

No worries - we are likely going to have something early to share within the next two weeks. Since you don't have time in this month, we can always ping you when we have a second iteration to see if you have time then!

I'm a huge fan of dinosaurs :)

tim-o commented 5 years ago

Zendesk ticket #3232 has been linked to this issue.

orangecoding commented 5 years ago

hey guys, what is the status on this?

RoachietheSupportRoach commented 5 years ago

Zendesk ticket #3577 has been linked to this issue.

Fornax96 commented 4 years ago

In pixeldrain.com I'm using the exact locking system which @glerchundi described for counting unique views on uploaded files. It works pretty well, but the removal query takes a few seconds to complete. I'm afraid this won't scale well if there are millions of rows to be deleted. A row TTL feature would be very useful for me.

bladefist commented 4 years ago

Deletes don't scale well on massive tables, so if Cockroach implements this they are going to have to do something other than delete in my opinion. Cleaning up data on huge tables at the moment is very difficult and basically requires you to stop all traffic to the db first.

sbward commented 4 years ago

One way to handle this might be to support automatic deletion when the primary key is a timestamp, or is a multi-column primary key containing a timestamp. In that situation it might be possible to free whole ranges when they fall outside of the TTL window. The engine would also need to ignore "dead" rows within ranges that still contain "live" rows, which would create some probably negligible overhead from skipping the group of dead rows.

Disclaimer: I don't know enough about CRDB's K/V system to know if this actually could work or not. It would be awesome to hear some feedback on whether something like this approach would be feasible.

ajwerner commented 4 years ago

One challenge I imagine here is foreign keys and secondary indexes. Ideally we’d push the TTL into the KV, perhaps via something like zone configs. Ideally we’d utilize MVCC timestamps to determine whether a value is dead due to TTL but that doesn’t really work because of the need to maintain referential integrity.

I don’t see an obvious way to implement this TTL efficiently for rows which are indexed in secondary indexes or part of fk relations. I imagine that’s why you see this functionality in nosql database like Cassandra and not in relational databases.

ajwerner commented 4 years ago

Perhaps one could do it by maintaining a separate index over timestamps for rows but that seems like a pretty big hammer.

ericharmeling commented 4 years ago

https://github.com/cockroachdb/docs/issues/5647#issuecomment-672213978

bdarnell commented 3 years ago

I think STORING indexes (or something like them) could provide an elegant solution here. In brief, the idea is that if a table has a TTL, all of its indexes must STORE the crdb_internal_mvcc_timestamp column (probably not literally with the current STORING index code due to the magic of that pseudo-column, but something equivalent). This means that every write to the table needs to update every index (not just ones that store the changed columns). Then, because we know that the primary and all secondary indexes will have matching timestamps, we can drop all values retrieved from the KV layer that are older than the TTL just by looking at the query timestamp and table descriptor. That frees us to make the GC asynchronous and independent - we can delete the primary KVs separate from the secondary indexes, but the SQL layer code will never see one without the other.

Regarding foreign keys, I agree this seems tricky and I think we'll need to disallow certain combinations. For example, you can't have an FK where the referred table has a shorter TTL than the referring table. But I think this solution would let you have e.g. a set of tables with the same TTL and FK relationships between them.

parminder-garcha commented 3 years ago

Just bumping this up, is there any road map to include this in future releases ?

jordanlewis commented 3 years ago

Dear users who are hoping for this feature: while we're working this out, there's a neat workaround, which is documented here.

Alternatively to @bdarnell's comment, we could avoid storing crdb_internal_mvcc_timestamp on every index, but continue to transparently include a WHERE crdb_internal_mvcc_timestamp > now() - TTL on all queries against the table via the optimizer. This causes an index join against the primary key, so you don't have to worry about not having the most up-to-date timestamp. The downside is that it prevents users from using covering indexes.

As far as the cleanup process, check out this neat little SQL snippet:

create table b (a primary key, b) as select g, g from generate_series(1,100000);

  WITH y AS (SELECT a FROM b WHERE a > $1 LIMIT 1000),
       d AS (
            DELETE FROM b
                  WHERE a IN (SELECT * FROM y)
                    AND crdb_internal_mvcc_timestamp / 1000000000 < now()::INT8 - 999
                  LIMIT 1000
              RETURNING 1
         )
SELECT last_value(a) OVER (ORDER BY a RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
  FROM y
 LIMIT 1;

this scans at most 1000 rows, deletes at most 1000 rows if they're older than a sample 999-second TTL, and returns the PK of the last row scanned. The 1000 constants can be adjusted, and this would also work with composite primary keys with some tweaking.

You could imagine having a job that runs this, persists the result, and replaces the $1 placeholder with the result of the query for the next run.

There are at least three important requirements to work out here:

Does setting a TTL on a table require that no data be shown to the user from outside of the TTL duration, or is it just a hint?
Can users set a true row-level TTL like in Cassandra, or is a table-level TTL sufficient? This issue is called "row-level TTL", but my assumption has been that the primary use cases are actually "table-level TTL": setting the TTL to a given duration for all data in a table.
Are TTLs related to row insert or row update? It's much easier and more natural to do the latter for us.

ajwerner commented 3 years ago

Are TTLs related to row insert or row update? It's much easier and more natural to do the latter for us.

I'm not sure this is true -- however, implementing the former is easy enough without anything from us.

Other important questions:

1) Is it okay if the cleanup has roughly the overhead of writing rows in the first place? 2) Do you need or want CDC events when rows are removed via TTL?

These two questions point in different directions for implementation. We've heard both. If you cannot stomach cleanup that costs as much as insertion, you almost definitely can't stomach a CDC event on cleanup. How important is the efficiency of the TTL here?

ajwerner commented 3 years ago

Regarding foreign keys, I agree this seems tricky and I think we'll need to disallow certain combinations. For example, you can't have an FK where the referred table has a shorter TTL than the referring table. But I think this solution would let you have e.g. a set of tables with the same TTL and FK relationships between them.

@bdarnell how did you envision dealing with later writes to the referencing table? @RaduBerinde raised the point that the referenced row may expire before the referencing row. Do we need a touch-write on the referenced row? That doesn't feel very intuitive.

bdarnell commented 3 years ago

I think I was getting my TTL semantics mixed up and thinking of our MVCC GC TTL (which keeps all versions newer than the TTL plus one older version) instead of what's being proposed here (which doesn't try to keep the version that was active as of the TTL). (is there another term we could use for one of these instead of saying "TTL" both times?)

I don't see a better solution offhand than touching the referenced row, which seems like a terrible potential bottleneck. I'd rather not get into that situation, so at least for v1 of this feature I'd disallow FKs that refer to tables with a TTL.

bladefist commented 3 years ago

I think most of your customers (like me) will be using this feature on tables for like log entries. So historical data just drops off. We have 700GB tables where I wish old records would just fall off and not require complicated flows to delete them.

In our case these tables never have relationships, they're more like a heap than anything.

glerchundi commented 3 years ago

Hi again!👋

Just wanted to share with you that TiDB crew is presenting this feature in KubeCon. Here you'll find the slides just in case there is something valuable on them: https://static.sched.com/hosted_files/kccnceu2021/cb/TTL%20in%20TiKV.pdf

I suppose that removing the dependency with RocksDB has its benefits but at the same time its drawbacks of not getting the features that it supports? I don't know if Pebble supports TTLs based autopurges though.

petermattis commented 3 years ago

RocksDB's TTL support doesn't provide a foundation for implementing TTLs in CRDB. Doing compaction time reclamation of keys is at odds with other parts of the CRDB architecture, though we're working towards making this possible (e.g. for compaction time garbage collection of old keys). Also, the linked presentation doesn't talk about the difficulties of maintaining relational consistency between SQL tables which have TTLs. I wonder how TiDB is handling that.

danthegoodman1 commented 3 years ago

Would still love to see this!

glerchundi commented 3 years ago

Friendly ref. about Google Cloud Spanner releasing today TTL based rows deletion: https://cloud.google.com/blog/products/spanner/reduce-costs-and-simplify-compliance-with-managed-ttl-in-spanner

otan commented 2 years ago

this is now in preview support for v22.1!

https://www.cockroachlabs.com/docs/dev/row-level-ttl.html

glerchundi commented 2 years ago

Thanks! This is huge!

cockroachdb / cockroach

storage+sql: Feature request: Row-level TTLs #20239