Closed dberardo-com closed 2 years ago
Additional information:
-o remount,rw
then i get a succesfull mount, but then i see very strange log from the kernel (hexadecimal strings) and the machine becomes practically unresponsive ....drbdadm connect all
is run 3 times, in the order: Primary, then Secondary, then on the Primary againDoes DRBD look healthy after you've reconnected it? I'm wondering if you're in an UpToDate/UpToDate
state, or some degraded state. Can you share what your full output from cat /proc/drbd
looks like when things are healthy? Then also what cat /proc/drbd
looks like when there is a network split?
since on of the node got disconnected from the network, i have a Primary/Unknown situation. And for some reason after the node comes back online i cannot manage to restore the DRBD connection using drbdadm connect all
on either one of the node.
How can i force a connection restore without rebooting either one of the nodes?
UPDATE: i am basically stuck on this situation on the secondary and the only solution i have found to restore connection is to reboot the primary. If i reboot just the secondary nothing happens:
secondary_node ~# drbdadm status
<drbd-status version="8.9.2rc1" api="88">
<resources config_file="/etc/drbd.conf">
<resource minor="0" name="r0" cs="WFConnection" ro1="Secondary" ro2="Unknown" ds1="UpToDate" ds2="DUnknown" />
</resources>
</drbd-status>
primary_node ~# drbdadm status
<drbd-status version="8.9.2rc1" api="88">
<resources config_file="/etc/drbd.conf">
<resource minor="0" name="r0" cs="StandAlone" ro1="Primary" ro2="Unknown" ds1="UpToDate" ds2="DUnknown" />
</resources>
</drbd-status>
Yes, but I'm asking for the disk states, not the roles. Please provide the information requested.
do you mean this?
ds1="UpToDate" ds2="DUnknown"
?
Yes, but I was hoping to see that when they were connected. I reread your original post and you said they do not connect. Chances are you are in a split-brain state. Run a drbdadm connect all
and then check the logs on both nodes for messages from DRBD mentioning "split-brain". If you see that, follow the steps for resolution in section 6.3 of the DRBD user guide: https://linbit.com/drbd-user-guide/users-guide-drbd-8-4/
ok, so i have managed to find a work-around, please let me know if there is a less hacky way to solve this issue.
I am still not clear about what the issue could be but the workaround involves 2 steps:
For the first point it turned out to be enough to run fsck /dev/drbd0
before trying to mount the partition. That command results in no error being displayed, which makes me wonder that maybe the kernel is somehow "not aware" that the disk has now become readable and that the filesystem is ready to be mounted. So i guess that running fsck somehow changes some bits in the background and makes linux aware of this fact. ?!?
The second point is a bit harder to fix because, as mentioned above, the resource r0
will not connect on neither of the machines. If i run the connect command on either one of the machines, it executes without errors for a couple of times without changing anything, and then from the second time on i get the error that the resource is already connected and i should disconnect it first ... but this is not true, since the status of disks and roles are still the same: Secondary/Unknown and Primary/Unknown.
I realized that to fix this i have to force a reboot on the old primary and to do so i have implemented the following 1pri policy:
after-sb-0pri discard-least-changes;
after-sb-1pri call-pri-lost-after-sb;
...
pri-lost-after-sb "reboot"
and this policy will automatically kick in when drbdadm connect r0
is called on the old Primary ... the connect command, however, does not happen automatically after the network is restored (why?) and so i am forced to use a cronjob that calls that command periodically.
So my doubts now are:
fsck
workaround? is this normal or is there something weird with filesystem i chose jfs
?NOTE: i might be wrong here, but i think that in a 2 nodes setup, if a primary node is suddenly disconnected from the network, the split-brain will always be unavoidable, right? in fact there is no way for the nodes to know which other node had a network problem. My guess is that the "startup" block might change this, but again, i am not sure if that could help.
In a 2 node cluster, yes, you'll always split brain if you unplug the cluster network. I'm not sure about heartbeat, it's been deprecated for a long long time as a CRM, but in Pacemaker clusters you can you can implement node level fencing with fence delays and node preferences that can work around these limitations in a much safer way.
Seems like this isn't a DRBD issue as much as an implementation issue. To further discuss your setup feel free to join the Slack community: https://linbit.com/join-the-linbit-open-source-community/.
dear DRBD community,
i have been struggling in the past days when debugging a simple scenario whereby i have 2 DRBD nodes connected in simple master-slave (single primary) mode, with protocol C.
The two nodes use heartbeat together with DRBD.
Failovers and disk mounting and replication works fine in all "controlled failover" scenarios, meaning when either heartbeat is stopped on one of the two machines or one node is rebboted / killed.
The problem is when one of the machines suddenly loses the connection (ethernet cable is unplagged), in which scenario Heartbeat starts corretly on the other node, which is promoted as Primary/Unknown but then when Heartbeat tries to mount the drbd disk using this command:
mount -t jfs /dev/drbd0 /my_folder
it fails with this strange error:Again, i have correctly partitioned and formatted disks/filesystems, and the DRBD + HA setup works fine in all other cases.
Is this a knownllimitation of DRBD or is there any other configuration i could test?
P.S. i am using drbd8-utils on a debian8, 32bits machine. Both nodes are identical (same image + hardware).