Tencent / Tendis

Tendis is a high-performance distributed storage system fully compatible with the Redis protocol.
http://tendis.cn
Other
2.87k stars 317 forks source link

kv分离模式写入value数据和落盘数据不一致问题 #204

Closed liupeidong0620 closed 1 year ago

liupeidong0620 commented 1 year ago

Description

配置

# tendisplus configuration for testing
bind 0.0.0.0
port 51002
daemon on
loglevel notice
logdir /apps/dbdat/tendis/log
dumpdir /apps/dbdat/tendis/dump
dir /apps/dbdat/tendis/db
pidfile /apps/dbdat/tendis/tendisplus.pid
slowlog /apps/dbdat/tendis/log/slowlog
rocks.blockcachemb 20480
executorThreadNum 48

cluster-enabled no

rocks.write_buffer_size 1073741824
rocks.target_file_size_base 268435456
rocks.max_bytes_for_level_base 268435456
rocks.max_background_compactions 8
rocks.max_write_buffer_number 2
rocks.min_write_buffer_number_to_merge 1

rocks.enable_blob_files 1
rocks.min_blob_size 1024
rocks.blob_file_size 268435456
rocks.enable_blob_garbage_collection 0

jeprof-auto-dump 0

kvstorecount 1
deljobcntindexmgr 1
scanjobcntindexmgr 1

版本

# ./tendisplus -v
Tendisplus v=2.5.0-rocksdb-v6.23.3 sha=00000000 dirty=0 build=pika-build-test-e9f8r.vclound.com-1661846670

系统信息

# uname -a
Linux 108051 2.6.32-696.1.1.el6.x86_64 #1 SMP Tue Apr 11 17:13:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
CentOS release 6.6 (Final)

rocksdb 日志

** Compaction Stats [binlog_cf] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0    1.03 MB   0.8      0.0     0.0      0.0       0.2      0.2       0.0   1.0      0.0    487.5    592.24            416.64       605    0.979       0      0       0.0     281.7
  L6    575/0   199.82 MB   0.0     26.1     0.0      0.0       0.0      0.0       0.2   0.0   2072.7      1.0     12.90             12.45        27    0.478   1358K      0      26.1       0.0
 Sum    578/0   200.85 MB   0.0     26.1     0.0      0.0       0.2      0.2       0.2   1.0     44.2    477.1    605.14            429.10       632    0.958   1358K      0      26.1     281.7
 Int      0/0    0.00 KB   0.0      3.3     0.0      0.0       0.0      0.0       0.0   1.0     32.8    468.9    101.78             75.34       103    0.988    169K      0       3.3      46.6

** Compaction Stats [binlog_cf] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0     26.1     0.0      0.0       0.0      0.0       0.0   0.0   2072.7      1.0     12.90             12.45        27    0.478   1358K      0      26.1       0.0
High      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2      0.2       0.0   0.0      0.0    487.5    592.24            416.64       605    0.979       0      0       0.0     281.7

Blob file count: 1210, total size: 281.7 GB

Uptime(secs): 4203.0 total, 600.0 interval
Flush(GB): cumulative 281.945, interval 46.602
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 281.96 GB write, 68.69 MB/s write, 26.11 GB read, 6.36 MB/s read, 605.1 seconds
Interval compaction: 46.60 GB write, 79.54 MB/s write, 3.26 GB read, 5.57 MB/s read, 101.8 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slo
wdown, interval 0 total count
Block cache LRUCache@0x2b87fe49c970 capacity: 20.00 GB collections: 8 last_copies: 1 last_secs: 6.7e-05 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(49,787.83 KB,0.00375666%) Misc(1,0.00 KB,0%)

磁盘数据

#  du -sh * | grep blob | awk -F"M" '{sum += $1};END {print sum/1024"GB"}'
563.936GB

[root@10.189.108.49(QAstagingSlave) 0]# ls -l *.blob | wc -l
2418

压测脚本

 /apps/svr/redis6/bin/redis-benchmark -t set -r 14680064 -n 14680064 -d 20480 -c 10 -h 10.189.108.49 -p 51002
liupeidong0620 commented 1 year ago

@qingping209

liupeidong0620 commented 1 year ago

@takenliu @tencent-adm 有空回复下,谢谢?

takenliu commented 1 year ago

因为有binlog,可以尝试调小相关参数:minBinlogKeepSec,maxBinlogKeepNum,slaveBinlogKeepNum

liupeidong0620 commented 1 year ago

因为有binlog,可以尝试调小相关参数:minBinlogKeepSec,maxBinlogKeepNum,slaveBinlogKeepNum

三者有优先级顺序吗?还有就是单实例模式,主从复制模式和cluster 模式下都可以用这三个参数控制吗?,你回复之前调整minBinlogKeepSec = 10,maxBinlogKeepNum = 1没有起作用,生成blob(生成三个binlog文件)后等一段时间(等待时间远超10s)没有发现blob文件有删除现象。slaveBinlogKeepNum (我目前单实例运行,不是集群模式,也非主从)这参数应该不影响删除现象

takenliu commented 1 year ago

目前这三个参数是同时生效的,不过清理的时候是写入了binlog的删除标记在lsm里面,但是blob文件里面的binlog得要等待rocksdb的gc来删除,这里有待优化。

liupeidong0620 commented 1 year ago

目前这三个参数是同时生效的,不过清理的时候是写入了binlog的删除标记在lsm里面,但是blob文件里面的binlog得要等待rocksdb的gc来删除,这里有待优化。

开启gc以后,删除时间无法控制(不知道何时gc掉所有数据)。目前测试来看写入60GB实际落盘120GB,过了大概一个小时数据量变成103GB,这个gc时间有控制方法吗?查看rocksdb中发现这个double blob_garbage_collection_age_cutoff = 0.25参数控制gc截止点,gc比例由它控制,是这个原因导致的吗?

liupeidong0620 commented 1 year ago

gc测试

# 打开gc
# 无法动态设置(官方api这个参数时可以动态设置的)
#  // Dynamically changeable through the SetOptions() API
#  bool enable_blob_garbage_collection = false;
# 思考: 比如数据迁移的过程中,写关闭gc,迁移完成在开启gc,是否性能提升?未知
rocks.enable_blob_garbage_collection yes

minBinlogKeepSec 100
maxBinlogKeepNum 1

测试

测试一:
写入60GB数据,落盘数据105GB (符合預期)

测试二:
重复写入测试一的60GB数据,落盘数据 182GB (怎么计算得道,没有想明白)

reshape命令执行以后,落盘数据103GB (符合预期)

reshape思考?

执行这个命令比较耗时,在进行compact,iostat观察磁盘io比较高,是否对正常读写造成影响,需要测试?
qingping209 commented 1 year ago

@qingping209

gc测试

# 打开gc
# 无法动态设置(官方api这个参数时可以动态设置的)
#  // Dynamically changeable through the SetOptions() API
#  bool enable_blob_garbage_collection = false;
# 思考: 比如数据迁移的过程中,写关闭gc,迁移完成在开启gc,是否性能提升?未知
rocks.enable_blob_garbage_collection yes

minBinlogKeepSec 100
maxBinlogKeepNum 1

测试

  • blob_garbage_collection_age_cutoff 无法修改
  • blob_garbage_collection_age_cutoff = 0.25 (参考这个参数,60GB + binlog(60 /4 * 3) = 105GB 实际落盘数据)
测试一:
写入60GB数据,落盘数据105GB (符合預期)

测试二:
重复写入测试一的60GB数据,落盘数据 182GB (怎么计算得道,没有想明白)

reshape命令执行以后,落盘数据103GB (符合预期)

reshape思考?

执行这个命令比较耗时,在进行compact,iostat观察磁盘io比较高,是否对正常读写造成影响,需要测试?

reshape会计算每次compact的大小;