io-class setting for passthrough big IO; test data validation with vdbench detected error.

phyorat commented 1 year ago

Description

IO-class can passthrough big IO, like bigger than 128KB, to skip cache and writing to HDD directly; this can get higher performance and cache efficience. IO-class configure file example:

IO class id,IO class name,Eviction priority,Allocation 0,unclassified,22,0 1,request_size:le:131072,1,1

After load this IO-class, 128K IO data will write into HDD directly; and also if no cached data in cache, no data will read from cache but from HDD.
But on the other hand, if part of requested data, for example, head-64K data of 128K is cached; then read should taking 64K from cache and the left 64K from HDD. 64K cached <> 64K from HDD |-------------------------------------|--------------------------------------|

This is expected data "splicing". The Acctual IO pathern is: 1, 64K write IO writing into cache, the first time with key = 1; 2, Read 64K and verified OK; 3, 64K write IO (with key = 2) merged two as one, writing 128KB into HDD directly; 4, Read 64K but verified failed (got key = 1, which indicating old data) (We guess miss-read old 64K from cache)

step 3-4 may also be: 3, 64K write IO (with key = 2); plus two IO-write; 4, Two-read 64K IO merge as one, reading 128K from HDD directly(because of IO-class rule [1]); 5, Verified failed (got key = 1, which indicating old data) (We guess miss-read old 128K data directly from HDD)

We verified data validation on this scenario, with vdbench; and found data validation error occured.

21:05:24.364 hd2-0: dvpost: /dev/vdb sd4 sd4 0x00000000 0x234520000 131072 0x0 0x5ecf4d1ed319c 0x11 0x2 0x70 0x0 0 36028797018963971 21:05:24.364 hd2-0: 21:05:24.364 hd2-0: Data Validation error for sd=sd4,lun=/dev/vdb 21:05:24.364 hd2-0: Block lba: 0x234520000; sector lba: 0x234520000; xfersize: 131072; relative sector in block: 0x00 ( 0) 21:05:24.364 hd2-0: ===> Data Validation Key miscompare. 21:05:24.364 hd2-0: ===> Data miscompare. 21:05:24.364 hd2-0: The sector below was written Tuesday, November 8, 2022 20:38:41.711 CST 21:05:24.364 hd2-0: 0x000 00000002 34520000 ........ ........ 00000002 34520000 0005ecf4 d1ed319c 21:05:24.364 hd2-0: 0x010 02..0000 73643420 20202020 00000000 01700000 20346473 20202020 00000000 21:05:24.364 hd2-0: Key miscompare always implies Data miscompare. Remainder of data suppressed.

This error shows that, tool wrote data-key "02xxxx" but read data from core is "01xxxx". The key point is, rgiht after error occured, we read data from core-dev direactly, the data is correct - "02xxxx". So there should be data align/validation issue between cache and HDD, in a very tiny time interval (serval miliseconds)?

After cancel/clear this IO-class, and test again, no data validation error occured any more.

In addition, this configure <Sequential cutoff policy: always; --threshold 128KB> can also trigger data validation error.

Expected Behavior

No data align/validation issue between cache and HDD, when setting IO-class for skiping big IO written into cache.

Actual Behavior

There should be data align/validation issue between cache and HDD, if IO-read data partially cached.

Steps to Reproduce

set IO-class to passthrough block IO bigger and equal than 128KB
use vdbench to test data write and read validation; data is build on distributed block-storage system, lower storage is opencas nvme-cached HDD.
vdbench report data validation error

Context

Base block storage for distributed block-system; need to guarantee that data validatoin is OK.

Possible Fix

Maybe meta-data are not strongly aligned or expired between different IO stage (in miliseconds).

Logs

No evidence until now; but reverse verification(remove that IO-class) can be a clue.

Your Environment

OpenCAS version (commit hash or tag): 22.03.0.0666.release
Operating System: CentOS Linux release 7.6.1810 (Core)
Kernel version: 5.10.38-21.hl02.el7.x86_64
Cache device type (NAND/Optane/other): NAND
Core device type (HDD/SSD/other): HDD
Cache configuration:
- Cache mode: wb
- Cache line size: 8
- Promotion policy: always
- Cleaning policy: alru
- Sequential cutoff policy: never
Other (e.g. lsblk, casadm -P, casadm -L)

mmichal10 commented 2 months ago

Hi @phyorat,

thank you for posting the issue. Do you happen to still have the vbdench config you used for your test?

mmichal10 commented 2 months ago

I came up with a fio config to mimic the vbdench's behaviour:

[dc_repro]
filename=/dev/cas1-1
ioengine=libaio
iodepth=1
direct=1
numjobs=1

# Generate new offset for every second write
rw=randwrite:2
rw_sequencer=identical

bssplit=64k/50:256k/50

# This ensures that every 64K write will be followed by 256K write
number_ios=2
loops=10000

verify=md5
# Verify after every write
verify_backlog=1
# Stop FIO if DC
verify_fatal=1

Open-CAS / open-cas-linux