Open annrpom opened 10 months ago
This is due to the priority set on TTL delete txns; we should investigate whether or not setting the transaction UserPriority to high would affect user queries in an unideal way.
Does this mean that the TTL deletes would be able to block and abort the workload transactions?
Yes, we would retry the workload txn(s) when this happens, correct?
If so, here is one convo I had:
Q: Wouldn't this just cause the reverse situation if the TTL job gets stuck somehow?
A: It might be possible, but in general, not expected - once the TTL operation goes through, the row will be deleted and nothing can contend on it anymore.
What happens if we see a high prio user txn where it might need to access the same expired row as the one that a high prio ttl job wants to delete?
Yes, we would retry the workload txn(s) when this happens, correct?
Either retry or just block the workload txns. That's probably not the right choice though, is it? The availability of workload txns is significantly more important than the availability of background TTL txns.
The availability of workload txns is significantly more important than the availability of background TTL txns.
Hm. This is true. Perhaps this issue should change to: "investigate ways to improve row-level ttl in the case of retries"
Instead of application-side filtering, we can filter expired rows out of queries that reference ttl-enabled tables ourselves
There is an issue for that last idea: https://github.com/cockroachdb/cockroach/issues/80217 Without completing that, we need to find a good way to communicate the issues that can result from foreign keys.
I would agree that we shouldn't make a change that could cause the TTL job to abort user-initiated queries.
In the event of a table (table A) with a foreign key referencing a row-level TTL enabled table (table B), users have to avoid the possibility of data contention by filtering out expired rows in their application logic in their queries on table A. This would ensure that the rows touched in table A do not contend with the TTL job (in which the job could block workload txns).
In the case where no application logic is performed and the queries on table A are occurring, we see our TTL jobs leave user queries hanging - like with https://github.com/cockroachlabs/support/issues/2786.
This is due to the UserPriority set on TTL deletes; we should investigate any workarounds. One discussed was using a high UserPriority/different UserPriority.
Jira issue: CRDB-35311