cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

Investigate using a high UserPriority for row-level TTL delete jobs #117694

Open annrpom opened 10 months ago

annrpom commented 10 months ago

In the event of a table (table A) with a foreign key referencing a row-level TTL enabled table (table B), users have to avoid the possibility of data contention by filtering out expired rows in their application logic in their queries on table A. This would ensure that the rows touched in table A do not contend with the TTL job (in which the job could block workload txns).

In the case where no application logic is performed and the queries on table A are occurring, we see our TTL jobs leave user queries hanging - like with https://github.com/cockroachlabs/support/issues/2786.

This is due to the UserPriority set on TTL deletes; we should investigate any workarounds. One discussed was using a high UserPriority/different UserPriority.

Jira issue: CRDB-35311

nvanbenschoten commented 10 months ago

This is due to the priority set on TTL delete txns; we should investigate whether or not setting the transaction UserPriority to high would affect user queries in an unideal way.

Does this mean that the TTL deletes would be able to block and abort the workload transactions?

annrpom commented 10 months ago

Yes, we would retry the workload txn(s) when this happens, correct?

If so, here is one convo I had:

Q: Wouldn't this just cause the reverse situation if the TTL job gets stuck somehow?

A: It might be possible, but in general, not expected - once the TTL operation goes through, the row will be deleted and nothing can contend on it anymore.

What happens if we see a high prio user txn where it might need to access the same expired row as the one that a high prio ttl job wants to delete?

nvanbenschoten commented 10 months ago

Yes, we would retry the workload txn(s) when this happens, correct?

Either retry or just block the workload txns. That's probably not the right choice though, is it? The availability of workload txns is significantly more important than the availability of background TTL txns.

annrpom commented 10 months ago

The availability of workload txns is significantly more important than the availability of background TTL txns.

Hm. This is true. Perhaps this issue should change to: "investigate ways to improve row-level ttl in the case of retries"

annrpom commented 10 months ago

Instead of application-side filtering, we can filter expired rows out of queries that reference ttl-enabled tables ourselves

rafiss commented 9 months ago

There is an issue for that last idea: https://github.com/cockroachdb/cockroach/issues/80217 Without completing that, we need to find a good way to communicate the issues that can result from foreign keys.

I would agree that we shouldn't make a change that could cause the TTL job to abort user-initiated queries.