facebook / mysql-5.6

Facebook's branch of the Oracle MySQL database. This includes MyRocks.
http://myrocks.io
Other
2.49k stars 712 forks source link

A corrupt rocksdb table case with TTL enabled and TTL read filtering disabled #1126

Open alston111111 opened 4 years ago

alston111111 commented 4 years ago

With the following table:

CREATE TABLE `t` (
`a` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT ‘id as PK’,
`ts` BIGINT UNSIGNED NOT NULL DEFAULT 0 COMMENT ‘ts as the TTL column’,
`b` BIGINT UNSIGNED NOT NULL DEFAULT 0 COMMENT ‘’,
`c` INT NOT NULL DEFAULT 0 COMMENT ‘’,
`d` LONG TEXT COMMENT ‘to make the row size large’,
PRIMARY KEY(`a`),
KEY `b` (`b`),
) ENGINE=ROCKSDB DEFAULT CHARSET utf8 COMMENT ‘ttl_duration=20;ttl_col=ts’;

Keep inserting into the table and the table will eventually get corrupted in that some of the index records of an TTL-expired row disappeard due to compactions and the row is inconsistent.

case:

SET rocksdb_enable_ttl = 1;
SET rocksdb_enable_ttl_read_filtering = 0;
# Keep inserting by using a loop.
INSERT INTO t (a,ts,b,c,d) VALUES(NULL, UNIX_TIMESTAMP(NOW()) - 1, CEIL(RAND()*10000000), CEIL(RAND()*10000000), REPEAT('a',10000));

The following configuration items are set to speed up the compaction frequency and thus increase the reproducing probability:

#### rocksdb - DB Options
rocksdb_max_background_jobs=1
rocksdb_max_subcompactions= 1
#### rocksdb - Column Family Options
rocksdb_default_cf_options=write_buffer_size=2m;target_file_size_base=128k;max_bytes_for_level_base=1m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=15;level0_stop_writes_trigger=20;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kNoCompression:kNoCompression:kNoCompression;bottommost_compression=kZSTD;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;max_bytes_for_level_multiplier=2;target_file_size_multiplier=2
## 
level0_file_num_compaction_trigger=4;
max_bytes_for_level_base=1m;
target_file_size_base=128k;
compression_per_level=kNoCompression:kNoCompression:kNoCompression:kNoCompression:kNoCompression:kNoCompression;bottommost_compression=kZSTD;
max_bytes_for_level_multiplier=2;
target_file_size_multiplier=2; 
################################

An Example after several minites:

## Example: row exists, but pk-value or sk-value in the row does not exists in secondary index.
# Try to find such a row.
mysql> SELECT a, ts, b, c, LENGTH(d) FROM t t1 WHERE t1.a NOT IN (SELECT DISTINCT a FROM t FORCE INDEX(b)) LIMIT 1;
+--------+------------+---------+---------+-----------+
| a      | ts         | b       | c       | LENGTH(d) |
+--------+------------+---------+---------+-----------+
| 289202 | 1587113106 | 5719941 | 3322155 |     10000 |
+--------+------------+---------+---------+-----------+
1 row in set (0.01 sec)

# verify the example.
mysql> SELECT a, ts, b, c, LENGTH(d) FROM t FORCE INDEX(PRIMARY) WHERE a = 289202;
+--------+------------+---------+---------+-----------+
| a      | ts         | b       | c       | LENGTH(d) |
+--------+------------+---------+---------+-----------+
| 289202 | 1587113106 | 5719941 | 3322155 |     10000 |
+--------+------------+---------+---------+-----------+
1 row in set (0.00 sec)

mysql> SELECT a, b FROM t FORCE INDEX(b) WHERE b = 5719941;
Empty set (0.00 sec)

mysql> SELECT a, b FROM t FORCE INDEX(b) WHERE a = 289202;
Empty set (0.00 sec)

## Example: sk record exists, but the corresponding pk value does not exist in primary index.
# Try to find such an sk k-v pair.
mysql> SELECT a, b FROM t t2 FORCE INDEX(b) WHERE t2.a NOT IN (SELECT a FROM t FORCE INDEX(PRIMARY)) LIMIT 1;
+--------+--------+
| a      | b      |
+--------+--------+
| 338000 | 194407 |
+--------+--------+
1 row in set (0.00 sec)

# Verify the example:
mysql> SELECT a, b FROM t FORCE INDEX(b) WHERE b = 194407;
+--------+--------+
| a      | b      |
+--------+--------+
| 338000 | 194407 |
+--------+--------+
1 row in set (0.01 sec)

mysql> SELECT a, ts, b, c, LENGTH(d) FROM t FORCE INDEX(PRIMARY) WHERE a = 338000;
Empty set (0.00 sec)

mysql> SELECT a, ts, b, c, LENGTH(d) FROM t FORCE INDEX(PRIMARY) WHERE b = 194407;
Empty set (0.35 sec)

Possible Reasons:

  1. Data Distribution and Arrangement: (see picture at the end)

It can be deduced from MyRocks record format and rocksdb file orginization that

(1) Records of the same index in a rocksdb table are clustered together in each LSM level except L0, while they might be spanning multiple sst files;

(2) The different index records of a mysql row (of a rocksdb table) are mostly scattered, may be in different sst files and even at different levels in the default LSM;

  1. compaction is "some-to-some";

ref: some source code in compaction_picker.cc and https://github.com/facebook/rocksdb/wiki/Compaction

According to the above reasons, the index records of the same row might have a chance to get partly involved in one compactoin and if they are TTL expired, the row (in SQL/storage engine layer) would be inconsistent.

Note: this, I think, might come from two more deeper reasons: (this is just my point of view)

  1. Compactions are taken inside rocksdb who has no idea about the constraints in myrocks-handler layer or SQL layer. This is different from InnoDB (storage engine layer as a whole).

  2. That TTL read filtering = ON and TTL timestamp stored in an entry give the way of marking an entry as "can-be-purged" (can be compared with delete-marked records or useless old-version records in InnoDB). When TTL read filtering is set OFF, the can-be-purged state of an index record is not clear in the reasonable state transition: (entry) being-used (in-trx) => not-used (in-trx) => can-be-purged (not visible any more) => purged.

So, I met this case and just report it here. (don't know if it's appropriate :) )

Referenced Image:

image

hermanlee commented 4 years ago

Thanks for filing the issue. The read filtering functionality is meant present a consistent view of row data across indexes in a table with TTL enabled. If it's disabled, then inconsistencies are expected to show up. One could argue this type of setting shouldn't be made available because of the consistency issues it causes, but it has been useful in debugging issues. It's not meant to be disabled for queries that require consistent views on the indexes.

alston111111 commented 4 years ago

Thanks for your reply. Yes, I agree with the recommend. We disabled TTL read filtering for temporarily fixing a problem, because we met a problem that was possibly relevant to the issue #1024 .

With TTL read filtering enabled, executing "INSERT ON DUPLICATE KEY UPDATE" would return "Can't find record" error if there is a conflict unique key value which was TTL expired but not removed by compactions yet.

Should I report it in detail as a new issue and relate it to #1024 , or just transfer to issue #1024 and make comments ?

hermanlee commented 4 years ago

Referencing issue #1024 is sufficient for the TTL bug. I believe we understand issue at hand, that filtering checks is not applied during ha_rocksdb::check_and_lock_sk(). It should be applied in a manner similar to ha_rocksdb::check_and_lock_unique_pk(), but unfortunately, we have not been able to make any progress on fixing this yet.

alston111111 commented 4 years ago

OK. Thanks!