Make MyRocks transactions never expire additional rows with TTL while running

laurynas-biveinis commented 4 months ago

The TTL row filtering is governed by Rdb_transaction::m_snapshot_timestamp field. This field sometimes get reset to zero when a snapshot is released, indicating that it should be reinitialized at the next snapshot, or to the current timestamp (when a single statement in a multi-statement transaction gets rolled back). The timestamp going forward means that additional rows may be hidden in the same transaction if they happened to expire between the two snapshots. Change it so that additional rows are never expired:

Only initialize the field once per transaction, when the first snapshot is created.
Do not touch it for single-statement rollbacks in multi-statement transactions.
Rename the field to m_earliest_snapshot_ts to better reflect what it does.
Change the global transaction list to be sorted by the oldest snapshot timestamp, so that the oldest global one can be found by reading the transaction at the list head. The transactions without snapshot are inserted to the its tail, and reinserted as needed upon acquiring the first snapshot. To make reinsertions fast, a third list pointer is maintained that points to the newest transaction with a snapshot.
Extend the SHOW ENGINE ROCKSDB TRANSACTION STATUS output with transactions that currently do not have a snapshot, for both cases of having a snapshot before and never having had one.
Add a test to rocksdb.ttl_primary_read_filtering for the behavior difference introduced by this commit.

At the same time:

Initialize m_earliest_snapshot_ts from the binlog HLC instead of the current system time.

facebook-github-bot commented 4 months ago

@hermanlee has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hermanlee commented 3 months ago

I think there is a potential issue in rdb_compact_filter.h with the calculate_expiration_timestamp() function. This function relies on RocksDB to return back the oldest snapshot time to be used for compaction. Even though each transaction keeps it own snapshot time, the compaction code would also need to figure out the oldest snapshot time across all transactions.

MyRocks currently relies on RocksDB to maintain this heap.

I wonder if this diff should be based on the RR/RC change to allow the read-only snapshot in RR to be created from the first read-only select. In that case, the snapshot never goes away, and the snapshot time never changes.

For RC, if we want the same behavior, we could also keep the read-only snapshot around just to maintain the timestamp for compaction, but each new read-only query would still acquire its own snapshot.

laurynas-biveinis commented 3 months ago

@hermanlee If I understand your comment correctly, currently there is a race condition between the compaction filter and the oldest transaction temporarily releasing its snapshot, thus maintaining only a per-transaction timestamp for the per-transaction filtering is not enough, and a transaction snapshot must be kept open constantly once it has been acquired.

laurynas-biveinis commented 3 months ago

@hermanlee , it seems that each transaction has to have three snapshot fields, even if it does not use them all in all cases:

the snapshot for the currently-executed statement;
the RR RO snapshot;
the snapshot that prevents expiration.

The 3rd one and the 1st one cannot be merged obviously. Likewise the 2nd one and the 1st one. The 3rd one and the 2nd one cannot be merged because an RR transaction may execute writes before its first read.

It could be possible to reduce complexity of transaction object maintaining three snapshots by moving the oldest timestamp tracking from RocksDB to MyRocks (have a heap/priority queue of timestamps in Rdb_compact_filter, insert/remove from it on transaction 1st snapshot create/commit&rollback), but a shared data structure that duplicates RocksDB internals might not be the best trade-off here.

facebook-github-bot commented 3 months ago

@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.

laurynas-biveinis commented 3 months ago

As discussed offline, will use the global transaction list instead. Step 1 of this approach is at https://github.com/facebook/mysql-5.6/pull/1471

facebook-github-bot commented 2 months ago

@hermanlee has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

laurynas-biveinis commented 1 month ago

Non-functional changes in this PR split off to https://github.com/facebook/mysql-5.6/pull/1487

sunshine-Chun commented 1 month ago

Non-functional changes in this PR split off to #1487

The change has been committed to the codebase.

facebook-github-bot commented 1 month ago

@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.

laurynas-biveinis commented 1 month ago

@sunshine-Chun , I have pushed my current version, which is as finished as possible, except for the binlog/RDB timestamp question above. Keeping in draft status until that is resolved.

facebook-github-bot commented 1 month ago

@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

laurynas-biveinis commented 1 month ago

MTR test seems to be failing. Have you rebased it to the lastest branch?

Which tests? Two debug MTR tests are failing with assert RDB timestamp > binlog timestamp, as I wrote before, and are the reason this PR is still draft.

sunshine-Chun commented 1 month ago

MTR test seems to be failing. Have you rebased it to the lastest branch? Which tests? Two debug MTR tests are failing with assert RDB timestamp > binlog timestamp, as I wrote before, and are the reason this PR is still draft.

Other than the assertions:

drop_cf_before_show_deadlock_info needs to be re-recorded.
performance_schema length mismatch: +Performance_schema_digest_lost 2374
rocksdb_concurrent_delete result content mismatch
sanitize_report, will sent the log via email.

sunshine-Chun commented 1 month ago

Just think aloud. Feel free to correct me if I miss anything here.

m_binlog_ttl_read_filtering_ts will only be only assigned once per transaction, so it basically behaves like the earliest_snapshot_ts in this change. If we refresh the snapshot during the long running transaction, the m_binlog_ttl_read_filtering_ts won't change its value. The rocksdb_binlog_ttl_compaction_timestamp is based on m_binlog_ttl_read_filtering_ts, so it won't change too. So it seems with binlog_ttl enabled, we already won't filter out more row when refreshing snapshot.

laurynas-biveinis commented 1 month ago

performance_schema length mismatch: +Performance_schema_digest_lost 2374

rocksdb_concurrent_delete result content mismatch

Cannot reproduce these two. For the latter, I intermittently get a pre-existing failure:

[ 71%] rocksdb.rocksdb_concurrent_delete 'innodb_ddse'  [ fail ]
        Test ended at 2024-08-27 12:24:08

CURRENT_TEST: rocksdb.rocksdb_concurrent_delete
--- /home/laurynas/vilniusdb/fb-mysql/mysql-test/suite/rocksdb/r/rocksdb_concurrent_delete.result   2024-03-19 14:53:38.570572735 +0300
+++ /home/laurynas/vilniusdb/fb-mysql/_build-llvm-debug/mysql-test/var/log/rocksdb_concurrent_delete.reject 2024-08-27 15:24:08.695347028 +0300
@@ -453,9 +453,9 @@
 set debug_sync='now SIGNAL go';
 select * from t1 where id1=1;
 id1    id2 value
+1  2   200
 1  4   1
 1  5   1
-1  2   200
 --End row delete with PRIMARY
 SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
 set debug_sync='rocksdb.get_row_by_rowid SIGNAL parked WAIT_FOR go';

mysqltest: Result content mismatch

sunshine-Chun commented 1 month ago

Just think aloud. Feel free to correct me if I miss anything here.

m_binlog_ttl_read_filtering_ts will only be only assigned once per transaction, so it basically behaves like the earliest_snapshot_ts in this change. If we refresh the snapshot during the long running transaction, the m_binlog_ttl_read_filtering_ts won't change its value. The rocksdb_binlog_ttl_compaction_timestamp is based on m_binlog_ttl_read_filtering_ts, so it won't change too. So it seems with binlog_ttl enabled, we already won't filter out more row when refreshing snapshot.

The current change and binlog ttl feature covers different scenarios. Both are needed. m_expiration_timestamp = min(oldest_transaction_ts, rocksdb_binlog_ttl_compaction_timestamp)

When m_expiration_timestamp = oldest_transaction_ts, the current change works. This happens when we have a long running transaction in secondary.
When m_expiration_timestamp = rocksdb_binlog_ttl_compaction_timestamp, the binlog ttl feature works. This happens when we have a long running transaction in both primary and secondary.

sunshine-Chun commented 1 month ago

performance_schema length mismatch: +Performance_schema_digest_lost 2374

rocksdb_concurrent_delete result content mismatch

Cannot reproduce these two. For the latter, I intermittently get a pre-existing failure:

[ 71%] rocksdb.rocksdb_concurrent_delete 'innodb_ddse'  [ fail ]
        Test ended at 2024-08-27 12:24:08

CURRENT_TEST: rocksdb.rocksdb_concurrent_delete
--- /home/laurynas/vilniusdb/fb-mysql/mysql-test/suite/rocksdb/r/rocksdb_concurrent_delete.result 2024-03-19 14:53:38.570572735 +0300
+++ /home/laurynas/vilniusdb/fb-mysql/_build-llvm-debug/mysql-test/var/log/rocksdb_concurrent_delete.reject   2024-08-27 15:24:08.695347028 +0300
@@ -453,9 +453,9 @@
 set debug_sync='now SIGNAL go';
 select * from t1 where id1=1;
 id1  id2 value
+1    2   200
 1    4   1
 1    5   1
-1    2   200
 --End row delete with PRIMARY
 SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
 set debug_sync='rocksdb.get_row_by_rowid SIGNAL parked WAIT_FOR go';

mysqltest: Result content mismatch

Then they are not related to this change.

laurynas-biveinis commented 1 month ago

sanitize_report, will sent the log via email.

This was caused by concurrent accesses to m_read_opts[TABLE_TYPE::USER_TABLE].snapshot, which appears to be a pre-existing issue and it seems that the current code may crash on SHOW ENGINE ROCKSDB STATUS concurrent to i.e. statement rollbacks. I am fixing it as follows, but overall the accesses do not appear to be synchronized properly and I'm making a TODO note to review it:

diff --git a/storage/rocksdb/ha_rocksdb.cc b/storage/rocksdb/ha_rocksdb.cc
index 3b5cf2da90c..9ba4860b03a 100644
--- a/storage/rocksdb/ha_rocksdb.cc
+++ b/storage/rocksdb/ha_rocksdb.cc
@@ -5503,6 +5503,10 @@ class Rdb_transaction_impl : public Rdb_transaction {
     if (m_rocksdb_tx[TABLE_TYPE::USER_TABLE]) {
       const rocksdb::Snapshot *const org_snapshot =
           m_rocksdb_tx[TABLE_TYPE::USER_TABLE]->GetSnapshot();
+
+      assert(org_snapshot == m_read_opts[TABLE_TYPE::USER_TABLE].snapshot);
+      m_read_opts[TABLE_TYPE::USER_TABLE].snapshot = nullptr;
+
       rollback_to_stmt_savepoint();

       const rocksdb::Snapshot *const cur_snapshot =

laurynas-biveinis commented 1 month ago

This patch is not enough. I'll remove the extra added accesses from other threads until the m_read_opts[USER_TABLE].snapshot concurrency is fixed

facebook-github-bot commented 1 month ago

@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.

laurynas-biveinis commented 1 month ago

@sunshine-Chun, addressed all comments as discussed above, fixed one linked list middle pointer manipulation logic error, and nanosecond to second conversion. Ready for review!

facebook-github-bot commented 1 month ago

@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 1 month ago

@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.

laurynas-biveinis commented 1 month ago

Pushed the branch again. The only changes are an extra comment at incr_insert_count and an additional note in the commit message, no actual code changes, any reviews in progress should not be invalidated by this push.

facebook-github-bot commented 4 weeks ago

@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.

laurynas-biveinis commented 4 weeks ago

Addressed last review comments, ready for review again

facebook-github-bot commented 4 weeks ago

@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

sunshine-Chun commented 4 weeks ago

rocksdb_binlog_ttl.test is failing for std::int64_t myrocks::Rdb_transaction::get_snapshot_ts() const: Assertion `m_earliest_snapshot_ts <= result || result == 0' failed.

facebook-github-bot commented 4 weeks ago

@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.

laurynas-biveinis commented 4 weeks ago

rocksdb_binlog_ttl.test is failing for std::int64_t myrocks::Rdb_transaction::get_snapshot_ts() const: Assertion `m_earliest_snapshot_ts <= result || result == 0' failed.

@sunshine-Chun , division by a billion got lost again :( Fixed

facebook-github-bot commented 4 weeks ago

@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 4 weeks ago

This pull request has been merged in facebook/mysql-5.6@8bf72fabd3b385ffe3f64dce1ba07b612559b1c5.

facebook / mysql-5.6

Make MyRocks transactions never expire additional rows with TTL while running #1458