Lost file when the master host is down.

pengweichu commented 2 years ago

Hi, we use DRBD 9.14 to synchronous the data(a partition ) between our HA cluster (host1, host2, host 3), once host1 is down, the application on host2 is starting to run, as our test, sometimes a few files in the synchronous partition will be reset to the empty file - the size is 0.

So does this is the DRBD issue or have we missed any configurations?

The below is our configuration files

# DRBD is the result of over a decade of development by LINBIT.
# In case you need professional services for DRBD or have
# feature requests visit http://www.linbit.com

global {
    usage-count yes;

    # Decide what kind of udev symlinks you want for "implicit" volumes
    # (those without explicit volume <vnr> {} block, implied vnr=0):
    # /dev/drbd/by-resource/<resource>/<vnr>   (explicit volumes)
    # /dev/drbd/by-resource/<resource>         (default for implict)
    udev-always-use-vnr; # treat implicit the same as explicit volumes

    # minor-count dialog-refresh disable-ip-verification
    # cmd-timeout-short 5; cmd-timeout-medium 121; cmd-timeout-long 600;
}

common {
    handlers {
        # These are EXAMPLE handlers only.
        # They may have severe implications,
        # like hard resetting the node under certain circumstances.
        # Be careful when choosing your poison.

        # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        # pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        # local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
        # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
        # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
        # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
        # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
        # quorum-lost "/usr/lib/drbd/notify-quorum-lost.sh root";
        # quorum-lost "echo b > /proc/sysrq-trigger ; reboot -f";
    }

    startup {
        # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
    }

    options {
        # cpu-mask on-no-data-accessible

        # RECOMMENDED for three or more storage nodes with DRBD 9:
        # quorum majority;
        # on-no-quorum suspend-io | io-error;
        quorum majority;
        on-no-quorum io-error;
    }

    disk {
        # size on-io-error fencing disk-barrier disk-flushes
        # disk-drain md-flushes resync-rate resync-after al-extents
                # c-plan-ahead c-delay-target c-fill-target c-max-rate
                # c-min-rate disk-timeout
    }

    net {
        # protocol timeout max-epoch-size max-buffers
        # connect-int ping-int sndbuf-size rcvbuf-size ko-count
        # allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri
        # after-sb-1pri after-sb-2pri always-asbp rr-conflict
        # ping-timeout data-integrity-alg tcp-cork on-congestion
        # congestion-fill congestion-extents csums-alg verify-alg
        # use-rle
        protocol C;
    }
}


resource pbxdata {

meta-disk internal;
device /dev/drbd1;
disk /dev/pbxvg/pbxlv;

syncer {
  verify-alg sha1;
}

net {
  after-sb-0pri discard-least-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
}

on pbx01 {
  address 192.168.1.91:7789;
  node-id 0;
}

on pbx02 {
  address 192.168.1.92:7789;
  node-id 1;
}

on pbx03 {
  address 192.168.1.93:7789;
  node-id 2;
}

connection-mesh {
  #node 1,2,3 name
  hosts pbx01 pbx02 pbx03;
  net {
      use-rle no;
  }
}

}

Thanks

johannesthoma commented 2 years ago

First guess is that some caches above DRBD are not flushed. Did you try running sync and/or oflag=direct or something?

JoelColledge commented 2 years ago

Since there has been no response to @johannesthoma's perceptive comment, I'm assuming that this was the problem. Closing.

LINBIT / drbd

Lost file when the master host is down. #23