Closed nixpanic closed 4 years ago
I have been experimenting with rbd create --image-shared
and rbd config set ... rbd_cache false
, but this does not seem to be sufficient. Maybe there are other options needed for rbd map ...
or something.
An RBD image should be configured with very strict consistency when it is used with RWX permissions. At the moment I do not know yet how to do that.
cc @dillaman
Summary
I suspect that there is some client-side caching that does not get invalidated (in time). Disabling rbd_cache
on the image does not make a difference. The only success (and no failures) is when the 2nd node (reader node in the tests) uses dd iflag=direct ..
to bypass any caching.
While comparing checksums, I observed that the repeated sha512sum /dev/xvda
on the reader node reads the contents from an old previous data writing. The same checksum gets repeated for the particular RBD image used for a series of tests. This checksum was never seen on the writer side. This may suggest that the kernel rbd module does not invalidate the read cache when O_DIRECT
is used for reading.
Conclusion
When using O_DIRECT
for writing and reading, the data is kept in sync correctly. The application using a raw block volume in multi-node configuration should take care of using direct-io for all reading and writing.
It is not yet clear if applications expect this behaviour. It probably is prefarable for the Ceph-CSI provisioner and/or attacher to force direct-io in the multi-node raw block volume case. (Can krbd be configured to do this, @dillaman?)
Test Details
Deployment to setup (oc apply -f ..
) and tear down (oc delete -f ..
) the pods and script to run:
The changes on the RBD image were done by removing the PVC, and re-creating it with a differently patched ceph-csi provisioner.
Legend:
.--------.----------------------------.----------------------------------------.-------------.
| result | RBD config | writer node | reader node |
| |----------------.-----------+--------------.----------------.--------+-------------|
| | --image-shared | rbd_cache | oflag=direct | conv=fdatasync | dd+sha | dd+sha |
|--------+----------------+-----------+--------------+----------------+--------+-------------|
| - | false | true | no | no | sha | sha |
| + | false | true | no | no | sha | dd+sha |
| | | | | | | |
| - | false | true | no | no | dd+sha | sha |
| + | false | true | no | no | dd+sha | dd+sha |
| | | | | | | |
| - | false | true | no | yes | sha | sha |
| + | false | true | no | yes | sha | dd+sha |
| | | | | | | |
| - | false | true | no | yes | dd+sha | sha |
| + | false | true | no | yes | dd+sha | dd+sha |
| | | | | | | |
| - | false | true | yes | no | sha | sha |
| + | false | true | yes | no | sha | dd+sha |
| | | | | | | |
| - | false | true | yes | no | dd+sha | sha |
| + | false | true | yes | no | dd+sha | dd+sha |
| | | | | | | |
| - | false | true | yes | yes | sha | sha |
| + | false | true | yes | yes | sha | dd+sha |
| | | | | | | |
| - | false | true | yes | yes | dd+sha | sha |
| + | false | true | yes | yes | dd+sha | dd+sha |
|--------+----------------+-----------+--------------+----------------+--------+-------------|
| - | true | true | no | no | sha | sha |
| + | true | true | no | no | sha | dd+sha |
| | | | | | | |
| - | true | true | no | no | dd+sha | sha |
| + | true | true | no | no | dd+sha | dd+sha |
| | | | | | | |
| - | true | true | no | yes | sha | sha |
| + | true | true | no | yes | sha | dd+sha |
| | | | | | | |
| - | true | true | no | yes | dd+sha | sha |
| + | true | true | no | yes | dd+sha | dd+sha |
| | | | | | | |
| - | true | true | yes | no | sha | sha |
| + | true | true | yes | no | sha | dd+sha |
| | | | | | | |
| - | true | true | yes | no | dd+sha | sha |
| + | true | true | yes | no | dd+sha | dd+sha |
| | | | | | | |
| - | true | true | yes | yes | sha | sha |
| + | true | true | yes | yes | sha | dd+sha |
| | | | | | | |
| - | true | true | yes | yes | dd+sha | sha |
| + | true | true | yes | yes | dd+sha | dd+sha |
|--------+----------------+-----------+--------------+----------------+--------+-------------|
| - | true | false | no | no | sha | sha |
| + | true | false | no | no | sha | dd+sha |
| | | | | | | |
| - | true | false | no | no | dd+sha | sha |
| + | true | false | no | no | dd+sha | dd+sha |
| | | | | | | |
| - | true | false | no | yes | sha | sha |
| + | true | false | no | yes | sha | dd+sha |
| | | | | | | |
| - | true | false | no | yes | dd+sha | sha |
| + | true | false | no | yes | dd+sha | dd+sha |
| | | | | | | |
| - | true | false | yes | no | sha | sha |
| + | true | false | yes | no | sha | dd+sha |
| | | | | | | |
| - | true | false | yes | no | dd+sha | sha |
| + | true | false | yes | no | dd+sha | dd+sha |
| | | | | | | |
| - | true | false | yes | yes | sha | sha |
| + | true | false | yes | yes | sha | dd+sha |
| | | | | | | |
| - | true | false | yes | yes | dd+sha | sha |
| + | true | false | yes | yes | dd+sha | dd+sha |
|--------+----------------+-----------+--------------+----------------+--------+-------------|
| - | false | false | no | no | sha | sha |
| + | false | false | no | no | sha | dd+sha |
| | | | | | | |
| - | false | false | no | no | dd+sha | sha |
| + | false | false | no | no | dd+sha | dd+sha |
| | | | | | | |
| - | false | false | no | yes | sha | sha |
| + | false | false | no | yes | sha | dd+sha |
| | | | | | | |
| - | false | false | no | yes | dd+sha | sha |
| + | false | false | no | yes | dd+sha | dd+sha |
| | | | | | | |
| - | false | false | yes | no | sha | sha |
| + | false | false | yes | no | sha | dd+sha |
| | | | | | | |
| - | false | false | yes | no | dd+sha | sha |
| + | false | false | yes | no | dd+sha | dd+sha |
| | | | | | | |
| - | false | false | yes | yes | sha | sha |
| + | false | false | yes | yes | sha | dd+sha |
| | | | | | | |
| - | false | false | yes | yes | dd+sha | sha |
| + | false | false | yes | yes | dd+sha | dd+sha |
'--------'----------------'-----------'--------------'----------------'--------'-------------'
@nixpanic IMO, the data consistency here is application's responsibility and most of the application who need that level of consistency use O_DIRECT
or avoid cache in between. RBD's responsibility should be to provide the block device and leave rest to the application. Also, this is not something CSI layer has to worry about, the reason being , this is no different than using RBD in similar fashion in a baremetal setup or in multinode. Thats my view here
@humblec is correct.
Kernel caching for the block device is normal and expected. Applications are responsible for synchronization via o_direct and/or sync calls. Applications sharing block devices have well-defined synchronization points when they flush data, and they should be left to handle that on their own due to the performance overhead of the flush. Additionally, yes, readers must be careful to avoid the kernel block cache as well.
It would, however, be a bug if the driver wasn't obeying the sync or direct I/O calls, but that doesn't seem to be the case here.
Thanks! I was just surprised to see that there is a read-ahead cache that does not get disabled by --image-shared
or rbd_cache
. If there is nothing the CSI driver can configure to optimize access to shared block devices (or single access block devices for that matter), then this issue can be closed.
The vast majority of RBD configuration options are not applicable to krbd
. In that respect, the block device is therefore acting like every other block device under Linux (the rbd-nbd
block device would have the save "issue"). The --image-shared
option only disables the exclusive-lock
feature (which the ceph-csi
never enables anyway).
@nixpanic @humblec can you close this as not a bug?
Describe the bug
When accessing a raw-block RBD volume on two nodes, the data is not consistent. There might be some caching playing a role here.
Environment details
Steps to reproduce
Steps to reproduce the behavior:
dd if=/dev/urandom of=/dev/xvda bs=512 ; sha512sum /dev/xvda
sha512sum /dev/xvda
whendd
has finishedActual results
The checksums of the data on the device are different.
Expected behavior
The checksums of the data on the device should match.
Additional context
Data consistency is important for live-migrating virtual-machines with KubeVirt. This problem does not occur when the two containers run on the same host.