RamenDR / ocm-ramen-samples

OCM Stateful application samples, including Ramen resources
Apache License 2.0
7 stars 65 forks source link

Possible small data loss during failover of DR app #19

Closed mbukatov closed 1 year ago

mbukatov commented 2 years ago

There seems to be a problem with the demo DR app.

Steps to Reproduce

  1. Deploy and configure Metro DR clusters (OCP for ACM Hub, 2 OCF/ODF and a shared external Ceph).
  2. Deploy busybox-sample[1] as DR app
  3. Enable fencing of the active cluster via DRCluster CR
  4. Perform failover of the app to the 2nd cluster via DRPlacementControl CR

[1] https://github.com/RamenDR/ocm-ramen-samples/tree/main/busybox-odr-metro

Actual results

When I enabled the fencing, I checked what is the app running on the primary cluster doing, and I see that it's still running, but the volume was moved to read only state and the app is no longer writing data as expected:

$ oc rsh -n busybox-sample busybox /bin/busybox sh
/ # ls -l /mnt/test/
total 492
drwx------    2 root     root         16384 Aug  4 12:03 lost+found
-rw-r--r--    1 root     root        484932 Aug  4 16:58 outfile
/ # date
Thu Aug  4 17:03:30 UTC 2022
/ # tail /mnt/test/outfile 
Thu Aug 4 16:57:57 UTC 2022
Thu Aug 4 16:57:58 UTC 2022
Thu Aug 4 16:57:59 UTC 2022
Thu Aug 4 16:58:00 UTC 2022
Thu Aug 4 16:58:01 UTC 2022
Thu Aug 4 16:58:02 UTC 2022
Thu Aug 4 16:58:03 UTC 2022
Thu Aug 4 16:58:04 UTC 2022
Thu Aug 4 16:58:05 UTC 2022
Thu Aug 4 16:58:06 UTC 2022
/ # touch /mnt/test/qe
touch: /mnt/test/qe: Read-only file system

Then after the failover, I checked the app pod again, this time on the 2nd cluster, and I see the app is running fine again:

$ oc rsh -n busybox-sample busybox /bin/busybox sh 
/ # tail /mnt/test/outfile 
Thu Aug 4 17:16:20 UTC 2022
Thu Aug 4 17:16:21 UTC 2022
Thu Aug 4 17:16:22 UTC 2022
Thu Aug 4 17:16:23 UTC 2022
Thu Aug 4 17:16:24 UTC 2022
Thu Aug 4 17:16:25 UTC 2022
Thu Aug 4 17:16:26 UTC 2022
Thu Aug 4 17:16:27 UTC 2022
Thu Aug 4 17:16:28 UTC 2022
Thu Aug 4 17:16:29 UTC 2022

But there is a gap in the data file, the last line is missing (there is no 'Thu Aug 4 16:58:06 UTC 2022' which was the last line in the file before failover):

Thu Aug 4 16:58:02 UTC 2022
Thu Aug 4 16:58:03 UTC 2022
Thu Aug 4 16:58:04 UTC 2022
Thu Aug 4 16:58:05 UTC 2022
Thu Aug 4 17:13:32 UTC 2022
Thu Aug 4 17:13:33 UTC 2022
Thu Aug 4 17:13:34 UTC 2022
Thu Aug 4 17:13:35 UTC 2022

Expected results

All data written by the app when running on the primary cluster is available on the secondary location.

nirs commented 1 year ago

Hey @mbukatov this is expected. Failing over will drop data written since the last replication. If you want to move the application to another cluster without losing any data, you need to use the "Relocate" action instead of the "Failover" action.

nirs commented 1 year ago

Closing since the behavior is expected. Feel free to reopen if needed.