LINBIT / drbd-utils

DRBD userspace utilities (for 9.x, 8.4, 8.3)
GNU General Public License v2.0
78 stars 46 forks source link

DRBD disk fails to mount in case of lost connection #23

Closed dberardo-com closed 2 years ago

dberardo-com commented 2 years ago

dear DRBD community,

i have been struggling in the past days when debugging a simple scenario whereby i have 2 DRBD nodes connected in simple master-slave (single primary) mode, with protocol C.

The two nodes use heartbeat together with DRBD.

Failovers and disk mounting and replication works fine in all "controlled failover" scenarios, meaning when either heartbeat is stopped on one of the two machines or one node is rebboted / killed.

The problem is when one of the machines suddenly loses the connection (ethernet cable is unplagged), in which scenario Heartbeat starts corretly on the other node, which is promoted as Primary/Unknown but then when Heartbeat tries to mount the drbd disk using this command: mount -t jfs /dev/drbd0 /my_folder it fails with this strange error:


ResourceManager(default)[22908]:        2022/05/20_17:16:01 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd0 /my_folder jfs noatime start
2022/05/20_17:16:01 INFO: Running start for /dev/drbd0 on /my_folder
Filesystem(Filesystem_/dev/drbd0)[23231]:       2022/05/20_17:16:01 INFO: Running start for /dev/drbd0 on /my_folder
mount: wrong fs type, bad option, bad superblock on /dev/drbd0,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
2022/05/20_17:16:01 ERROR: Couldn't mount filesystem /dev/drbd0 on /my_folder
Filesystem(Filesystem_/dev/drbd0)[23231]:       2022/05/20_17:16:01 ERROR: Couldn't mount filesystem /dev/drbd0 on /my_folder

Again, i have correctly partitioned and formatted disks/filesystems, and the DRBD + HA setup works fine in all other cases.

Is this a knownllimitation of DRBD or is there any other configuration i could test?

P.S. i am using drbd8-utils on a debian8, 32bits machine. Both nodes are identical (same image + hardware).

dberardo-com commented 2 years ago

Additional information:

  1. if i manually mount the filesystem in readonly mode on the node which hasnt failed (the new real Primary), then the mount succeed. This demonstrates the the error log is wrong, since the formatted filesystem is the correct one. Then if i remount the disk by adding the option -o remount,rw then i get a succesfull mount, but then i see very strange log from the kernel (hexadecimal strings) and the machine becomes practically unresponsive ....
  2. during this "unresolved failover", i.e. while the second node still has network problems) i am not able to restart the drbd service on the new Primary, which fails with a 005/NOTINSTALLED error
  3. when the old node comes back in the network the heartbeat notices it and starts the failover again, now everything gets back to normal, but the cluster is in Primary/Unknown and Secondary/Unknown state forever, until i run drbdadm connect all is run 3 times, in the order: Primary, then Secondary, then on the Primary again
kermat commented 2 years ago

Does DRBD look healthy after you've reconnected it? I'm wondering if you're in an UpToDate/UpToDate state, or some degraded state. Can you share what your full output from cat /proc/drbd looks like when things are healthy? Then also what cat /proc/drbd looks like when there is a network split?

dberardo-com commented 2 years ago

since on of the node got disconnected from the network, i have a Primary/Unknown situation. And for some reason after the node comes back online i cannot manage to restore the DRBD connection using drbdadm connect all on either one of the node.

How can i force a connection restore without rebooting either one of the nodes?

UPDATE: i am basically stuck on this situation on the secondary and the only solution i have found to restore connection is to reboot the primary. If i reboot just the secondary nothing happens:

secondary_node ~# drbdadm status
<drbd-status version="8.9.2rc1" api="88">
<resources config_file="/etc/drbd.conf">
<resource minor="0" name="r0" cs="WFConnection" ro1="Secondary" ro2="Unknown" ds1="UpToDate" ds2="DUnknown" />
</resources>
</drbd-status>
primary_node ~# drbdadm status
<drbd-status version="8.9.2rc1" api="88">
<resources config_file="/etc/drbd.conf">
<resource minor="0" name="r0" cs="StandAlone" ro1="Primary" ro2="Unknown" ds1="UpToDate" ds2="DUnknown" />
</resources>
</drbd-status>
kermat commented 2 years ago

Yes, but I'm asking for the disk states, not the roles. Please provide the information requested.

dberardo-com commented 2 years ago

do you mean this?

ds1="UpToDate" ds2="DUnknown"

?

kermat commented 2 years ago

Yes, but I was hoping to see that when they were connected. I reread your original post and you said they do not connect. Chances are you are in a split-brain state. Run a drbdadm connect all and then check the logs on both nodes for messages from DRBD mentioning "split-brain". If you see that, follow the steps for resolution in section 6.3 of the DRBD user guide: https://linbit.com/drbd-user-guide/users-guide-drbd-8-4/

dberardo-com commented 2 years ago

ok, so i have managed to find a work-around, please let me know if there is a less hacky way to solve this issue.

I am still not clear about what the issue could be but the workaround involves 2 steps:

  1. mount the disk on the new Primary
  2. fix the split-brain situation

For the first point it turned out to be enough to run fsck /dev/drbd0 before trying to mount the partition. That command results in no error being displayed, which makes me wonder that maybe the kernel is somehow "not aware" that the disk has now become readable and that the filesystem is ready to be mounted. So i guess that running fsck somehow changes some bits in the background and makes linux aware of this fact. ?!?

The second point is a bit harder to fix because, as mentioned above, the resource r0 will not connect on neither of the machines. If i run the connect command on either one of the machines, it executes without errors for a couple of times without changing anything, and then from the second time on i get the error that the resource is already connected and i should disconnect it first ... but this is not true, since the status of disks and roles are still the same: Secondary/Unknown and Primary/Unknown. I realized that to fix this i have to force a reboot on the old primary and to do so i have implemented the following 1pri policy:

after-sb-0pri discard-least-changes;
after-sb-1pri call-pri-lost-after-sb;
... 
pri-lost-after-sb "reboot"

and this policy will automatically kick in when drbdadm connect r0 is called on the old Primary ... the connect command, however, does not happen automatically after the network is restored (why?) and so i am forced to use a cronjob that calls that command periodically.

So my doubts now are:

  1. Is there a better way to achieve this? Is there any configuration to make the reconnection attempt automatic in DRBD without having to rely on a cron job?
  2. and how about the fsck workaround? is this normal or is there something weird with filesystem i chose jfs?

NOTE: i might be wrong here, but i think that in a 2 nodes setup, if a primary node is suddenly disconnected from the network, the split-brain will always be unavoidable, right? in fact there is no way for the nodes to know which other node had a network problem. My guess is that the "startup" block might change this, but again, i am not sure if that could help.

kermat commented 2 years ago
  1. You really shouldn't implement automatic split brain recovery policies unless you're using proper fencing and dual-primary with a clustered file system. If you do, there will eventually be a time where you automatically discard the wrong data.
  2. You shouldn't need to fsck things to have a usable file system. I've no experience with jfs, so I can't say if it could have something to do with it.

In a 2 node cluster, yes, you'll always split brain if you unplug the cluster network. I'm not sure about heartbeat, it's been deprecated for a long long time as a CRM, but in Pacemaker clusters you can you can implement node level fencing with fence delays and node preferences that can work around these limitations in a much safer way.

kermat commented 2 years ago

Seems like this isn't a DRBD issue as much as an implementation issue. To further discuss your setup feel free to join the Slack community: https://linbit.com/join-the-linbit-open-source-community/.