ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

Data inconsistency danger when multiple nodes accessing a krbd raw-block volume without directio #461

Closed nixpanic closed 4 years ago

nixpanic commented 5 years ago

Describe the bug

When accessing a raw-block RBD volume on two nodes, the data is not consistent. There might be some caching playing a role here.

Environment details

Steps to reproduce

Steps to reproduce the behavior:

  1. create a block-PVC with RWX properties
  2. start two containers on different nodes with the device attached to /dev/xvda
  3. in one container run dd if=/dev/urandom of=/dev/xvda bs=512 ; sha512sum /dev/xvda
  4. in the other container run sha512sum /dev/xvda when dd has finished

Actual results

The checksums of the data on the device are different.

Expected behavior

The checksums of the data on the device should match.

Additional context

Data consistency is important for live-migrating virtual-machines with KubeVirt. This problem does not occur when the two containers run on the same host.

nixpanic commented 5 years ago

I have been experimenting with rbd create --image-shared and rbd config set ... rbd_cache false, but this does not seem to be sufficient. Maybe there are other options needed for rbd map ... or something.

An RBD image should be configured with very strict consistency when it is used with RWX permissions. At the moment I do not know yet how to do that.

Madhu-1 commented 5 years ago

cc @dillaman

nixpanic commented 5 years ago

Summary I suspect that there is some client-side caching that does not get invalidated (in time). Disabling rbd_cache on the image does not make a difference. The only success (and no failures) is when the 2nd node (reader node in the tests) uses dd iflag=direct .. to bypass any caching.

While comparing checksums, I observed that the repeated sha512sum /dev/xvda on the reader node reads the contents from an old previous data writing. The same checksum gets repeated for the particular RBD image used for a series of tests. This checksum was never seen on the writer side. This may suggest that the kernel rbd module does not invalidate the read cache when O_DIRECT is used for reading.

Conclusion When using O_DIRECT for writing and reading, the data is kept in sync correctly. The application using a raw block volume in multi-node configuration should take care of using direct-io for all reading and writing.

It is not yet clear if applications expect this behaviour. It probably is prefarable for the Ceph-CSI provisioner and/or attacher to force direct-io in the multi-node raw block volume case. (Can krbd be configured to do this, @dillaman?)

Test Details

Deployment to setup (oc apply -f ..) and tear down (oc delete -f ..) the pods and script to run:

The changes on the RBD image were done by removing the PVC, and re-creating it with a differently patched ceph-csi provisioner.

(raw test results)

Legend:

    .--------.----------------------------.----------------------------------------.-------------.
    | result | RBD config                 | writer node                            | reader node |
    |        |----------------.-----------+--------------.----------------.--------+-------------|
    |        | --image-shared | rbd_cache | oflag=direct | conv=fdatasync | dd+sha | dd+sha      |
    |--------+----------------+-----------+--------------+----------------+--------+-------------|
    | -      | false          | true      | no           | no             | sha    | sha         |
    | +      | false          | true      | no           | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | true      | no           | no             | dd+sha | sha         |
    | +      | false          | true      | no           | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | true      | no           | yes            | sha    | sha         |
    | +      | false          | true      | no           | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | true      | no           | yes            | dd+sha | sha         |
    | +      | false          | true      | no           | yes            | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | true      | yes          | no             | sha    | sha         |
    | +      | false          | true      | yes          | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | true      | yes          | no             | dd+sha | sha         |
    | +      | false          | true      | yes          | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | true      | yes          | yes            | sha    | sha         |
    | +      | false          | true      | yes          | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | true      | yes          | yes            | dd+sha | sha         |
    | +      | false          | true      | yes          | yes            | dd+sha | dd+sha      |
    |--------+----------------+-----------+--------------+----------------+--------+-------------|
    | -      | true           | true      | no           | no             | sha    | sha         |
    | +      | true           | true      | no           | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | true      | no           | no             | dd+sha | sha         |
    | +      | true           | true      | no           | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | true      | no           | yes            | sha    | sha         |
    | +      | true           | true      | no           | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | true      | no           | yes            | dd+sha | sha         |
    | +      | true           | true      | no           | yes            | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | true      | yes          | no             | sha    | sha         |
    | +      | true           | true      | yes          | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | true      | yes          | no             | dd+sha | sha         |
    | +      | true           | true      | yes          | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | true      | yes          | yes            | sha    | sha         |
    | +      | true           | true      | yes          | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | true      | yes          | yes            | dd+sha | sha         |
    | +      | true           | true      | yes          | yes            | dd+sha | dd+sha      |
    |--------+----------------+-----------+--------------+----------------+--------+-------------|
    | -      | true           | false     | no           | no             | sha    | sha         |
    | +      | true           | false     | no           | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | false     | no           | no             | dd+sha | sha         |
    | +      | true           | false     | no           | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | false     | no           | yes            | sha    | sha         |
    | +      | true           | false     | no           | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | false     | no           | yes            | dd+sha | sha         |
    | +      | true           | false     | no           | yes            | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | false     | yes          | no             | sha    | sha         |
    | +      | true           | false     | yes          | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | false     | yes          | no             | dd+sha | sha         |
    | +      | true           | false     | yes          | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | false     | yes          | yes            | sha    | sha         |
    | +      | true           | false     | yes          | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | true           | false     | yes          | yes            | dd+sha | sha         |
    | +      | true           | false     | yes          | yes            | dd+sha | dd+sha      |
    |--------+----------------+-----------+--------------+----------------+--------+-------------|
    | -      | false          | false     | no           | no             | sha    | sha         |
    | +      | false          | false     | no           | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | false     | no           | no             | dd+sha | sha         |
    | +      | false          | false     | no           | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | false     | no           | yes            | sha    | sha         |
    | +      | false          | false     | no           | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | false     | no           | yes            | dd+sha | sha         |
    | +      | false          | false     | no           | yes            | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | false     | yes          | no             | sha    | sha         |
    | +      | false          | false     | yes          | no             | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | false     | yes          | no             | dd+sha | sha         |
    | +      | false          | false     | yes          | no             | dd+sha | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | false     | yes          | yes            | sha    | sha         |
    | +      | false          | false     | yes          | yes            | sha    | dd+sha      |
    |        |                |           |              |                |        |             |
    | -      | false          | false     | yes          | yes            | dd+sha | sha         |
    | +      | false          | false     | yes          | yes            | dd+sha | dd+sha      |
    '--------'----------------'-----------'--------------'----------------'--------'-------------'
humblec commented 5 years ago

@nixpanic IMO, the data consistency here is application's responsibility and most of the application who need that level of consistency use O_DIRECT or avoid cache in between. RBD's responsibility should be to provide the block device and leave rest to the application. Also, this is not something CSI layer has to worry about, the reason being , this is no different than using RBD in similar fashion in a baremetal setup or in multinode. Thats my view here

JohnStrunk commented 5 years ago

@humblec is correct.

Kernel caching for the block device is normal and expected. Applications are responsible for synchronization via o_direct and/or sync calls. Applications sharing block devices have well-defined synchronization points when they flush data, and they should be left to handle that on their own due to the performance overhead of the flush. Additionally, yes, readers must be careful to avoid the kernel block cache as well.

It would, however, be a bug if the driver wasn't obeying the sync or direct I/O calls, but that doesn't seem to be the case here.

nixpanic commented 5 years ago

Thanks! I was just surprised to see that there is a read-ahead cache that does not get disabled by --image-shared or rbd_cache. If there is nothing the CSI driver can configure to optimize access to shared block devices (or single access block devices for that matter), then this issue can be closed.

dillaman commented 5 years ago

The vast majority of RBD configuration options are not applicable to krbd. In that respect, the block device is therefore acting like every other block device under Linux (the rbd-nbd block device would have the save "issue"). The --image-shared option only disables the exclusive-lock feature (which the ceph-csi never enables anyway).

Madhu-1 commented 5 years ago

@nixpanic @humblec can you close this as not a bug?