Closed laurynas-biveinis closed 4 weeks ago
@hermanlee has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
I think there is a potential issue in rdb_compact_filter.h with the calculate_expiration_timestamp()
function. This function relies on RocksDB to return back the oldest snapshot time to be used for compaction. Even though each transaction keeps it own snapshot time, the compaction code would also need to figure out the oldest snapshot time across all transactions.
MyRocks currently relies on RocksDB to maintain this heap.
I wonder if this diff should be based on the RR/RC change to allow the read-only snapshot in RR to be created from the first read-only select. In that case, the snapshot never goes away, and the snapshot time never changes.
For RC, if we want the same behavior, we could also keep the read-only snapshot around just to maintain the timestamp for compaction, but each new read-only query would still acquire its own snapshot.
@hermanlee If I understand your comment correctly, currently there is a race condition between the compaction filter and the oldest transaction temporarily releasing its snapshot, thus maintaining only a per-transaction timestamp for the per-transaction filtering is not enough, and a transaction snapshot must be kept open constantly once it has been acquired.
@hermanlee , it seems that each transaction has to have three snapshot fields, even if it does not use them all in all cases:
The 3rd one and the 1st one cannot be merged obviously. Likewise the 2nd one and the 1st one. The 3rd one and the 2nd one cannot be merged because an RR transaction may execute writes before its first read.
It could be possible to reduce complexity of transaction object maintaining three snapshots by moving the oldest timestamp tracking from RocksDB to MyRocks (have a heap/priority queue of timestamps in Rdb_compact_filter, insert/remove from it on transaction 1st snapshot create/commit&rollback), but a shared data structure that duplicates RocksDB internals might not be the best trade-off here.
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.
As discussed offline, will use the global transaction list instead. Step 1 of this approach is at https://github.com/facebook/mysql-5.6/pull/1471
@hermanlee has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Non-functional changes in this PR split off to https://github.com/facebook/mysql-5.6/pull/1487
Non-functional changes in this PR split off to #1487
The change has been committed to the codebase.
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.
@sunshine-Chun , I have pushed my current version, which is as finished as possible, except for the binlog/RDB timestamp question above. Keeping in draft status until that is resolved.
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
MTR test seems to be failing. Have you rebased it to the lastest branch?
Which tests? Two debug MTR tests are failing with assert RDB timestamp > binlog timestamp, as I wrote before, and are the reason this PR is still draft.
MTR test seems to be failing. Have you rebased it to the lastest branch? Which tests? Two debug MTR tests are failing with assert RDB timestamp > binlog timestamp, as I wrote before, and are the reason this PR is still draft.
Other than the assertions:
drop_cf_before_show_deadlock_info
needs to be re-recorded.performance_schema
length mismatch: +Performance_schema_digest_lost 2374
rocksdb_concurrent_delete
result content mismatchJust think aloud. Feel free to correct me if I miss anything here.
m_binlog_ttl_read_filtering_ts
will only be only assigned once per transaction, so it basically behaves like the earliest_snapshot_ts in this change. If we refresh the snapshot during the long running transaction, the m_binlog_ttl_read_filtering_ts
won't change its value. The rocksdb_binlog_ttl_compaction_timestamp
is based on m_binlog_ttl_read_filtering_ts
, so it won't change too. So it seems with binlog_ttl enabled, we already won't filter out more row when refreshing snapshot.
performance_schema
length mismatch:+Performance_schema_digest_lost 2374
rocksdb_concurrent_delete
result content mismatch
Cannot reproduce these two. For the latter, I intermittently get a pre-existing failure:
[ 71%] rocksdb.rocksdb_concurrent_delete 'innodb_ddse' [ fail ]
Test ended at 2024-08-27 12:24:08
CURRENT_TEST: rocksdb.rocksdb_concurrent_delete
--- /home/laurynas/vilniusdb/fb-mysql/mysql-test/suite/rocksdb/r/rocksdb_concurrent_delete.result 2024-03-19 14:53:38.570572735 +0300
+++ /home/laurynas/vilniusdb/fb-mysql/_build-llvm-debug/mysql-test/var/log/rocksdb_concurrent_delete.reject 2024-08-27 15:24:08.695347028 +0300
@@ -453,9 +453,9 @@
set debug_sync='now SIGNAL go';
select * from t1 where id1=1;
id1 id2 value
+1 2 200
1 4 1
1 5 1
-1 2 200
--End row delete with PRIMARY
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
set debug_sync='rocksdb.get_row_by_rowid SIGNAL parked WAIT_FOR go';
mysqltest: Result content mismatch
Just think aloud. Feel free to correct me if I miss anything here.
m_binlog_ttl_read_filtering_ts
will only be only assigned once per transaction, so it basically behaves like the earliest_snapshot_ts in this change. If we refresh the snapshot during the long running transaction, them_binlog_ttl_read_filtering_ts
won't change its value. Therocksdb_binlog_ttl_compaction_timestamp
is based onm_binlog_ttl_read_filtering_ts
, so it won't change too. So it seems with binlog_ttl enabled, we already won't filter out more row when refreshing snapshot.
The current change and binlog ttl feature covers different scenarios. Both are needed.
m_expiration_timestamp = min(oldest_transaction_ts, rocksdb_binlog_ttl_compaction_timestamp)
performance_schema
length mismatch:+Performance_schema_digest_lost 2374
rocksdb_concurrent_delete
result content mismatchCannot reproduce these two. For the latter, I intermittently get a pre-existing failure:
[ 71%] rocksdb.rocksdb_concurrent_delete 'innodb_ddse' [ fail ] Test ended at 2024-08-27 12:24:08 CURRENT_TEST: rocksdb.rocksdb_concurrent_delete --- /home/laurynas/vilniusdb/fb-mysql/mysql-test/suite/rocksdb/r/rocksdb_concurrent_delete.result 2024-03-19 14:53:38.570572735 +0300 +++ /home/laurynas/vilniusdb/fb-mysql/_build-llvm-debug/mysql-test/var/log/rocksdb_concurrent_delete.reject 2024-08-27 15:24:08.695347028 +0300 @@ -453,9 +453,9 @@ set debug_sync='now SIGNAL go'; select * from t1 where id1=1; id1 id2 value +1 2 200 1 4 1 1 5 1 -1 2 200 --End row delete with PRIMARY SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ; set debug_sync='rocksdb.get_row_by_rowid SIGNAL parked WAIT_FOR go'; mysqltest: Result content mismatch
Then they are not related to this change.
- sanitize_report, will sent the log via email.
This was caused by concurrent accesses to m_read_opts[TABLE_TYPE::USER_TABLE].snapshot
, which appears to be a pre-existing issue and it seems that the current code may crash on SHOW ENGINE ROCKSDB STATUS
concurrent to i.e. statement rollbacks. I am fixing it as follows, but overall the accesses do not appear to be synchronized properly and I'm making a TODO note to review it:
diff --git a/storage/rocksdb/ha_rocksdb.cc b/storage/rocksdb/ha_rocksdb.cc
index 3b5cf2da90c..9ba4860b03a 100644
--- a/storage/rocksdb/ha_rocksdb.cc
+++ b/storage/rocksdb/ha_rocksdb.cc
@@ -5503,6 +5503,10 @@ class Rdb_transaction_impl : public Rdb_transaction {
if (m_rocksdb_tx[TABLE_TYPE::USER_TABLE]) {
const rocksdb::Snapshot *const org_snapshot =
m_rocksdb_tx[TABLE_TYPE::USER_TABLE]->GetSnapshot();
+
+ assert(org_snapshot == m_read_opts[TABLE_TYPE::USER_TABLE].snapshot);
+ m_read_opts[TABLE_TYPE::USER_TABLE].snapshot = nullptr;
+
rollback_to_stmt_savepoint();
const rocksdb::Snapshot *const cur_snapshot =
This patch is not enough. I'll remove the extra added accesses from other threads until the m_read_opts[USER_TABLE].snapshot
concurrency is fixed
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.
@sunshine-Chun, addressed all comments as discussed above, fixed one linked list middle pointer manipulation logic error, and nanosecond to second conversion. Ready for review!
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.
Pushed the branch again. The only changes are an extra comment at incr_insert_count
and an additional note in the commit message, no actual code changes, any reviews in progress should not be invalidated by this push.
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.
Addressed last review comments, ready for review again
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
rocksdb_binlog_ttl.test is failing for std::int64_t myrocks::Rdb_transaction::get_snapshot_ts() const: Assertion `m_earliest_snapshot_ts <= result || result == 0' failed.
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing.
rocksdb_binlog_ttl.test is failing for
std::int64_t myrocks::Rdb_transaction::get_snapshot_ts() const: Assertion `m_earliest_snapshot_ts <= result || result == 0' failed.
@sunshine-Chun , division by a billion got lost again :( Fixed
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
This pull request has been merged in facebook/mysql-5.6@8bf72fabd3b385ffe3f64dce1ba07b612559b1c5.
The TTL row filtering is governed by Rdb_transaction::m_snapshot_timestamp field. This field sometimes get reset to zero when a snapshot is released, indicating that it should be reinitialized at the next snapshot, or to the current timestamp (when a single statement in a multi-statement transaction gets rolled back). The timestamp going forward means that additional rows may be hidden in the same transaction if they happened to expire between the two snapshots. Change it so that additional rows are never expired:
At the same time: