ClusterLabs / crmsh

Command-line interface for High-Availability cluster management on GNU/Linux systems.
GNU General Public License v2.0
131 stars 94 forks source link

crm on openSUSE throws error on joining cluster, as wrong network interface is probed #1204

Closed roopchansinghv closed 4 months ago

roopchansinghv commented 1 year ago

Setup:

OS: openSUSE 15.5 HW: HP DL380s and Z6's with multiple network interfaces

I set up the main cluster node, a DL380, without issues. This machine has 12 hardware network interfaces, only 1 of which is configured during this setup phase.

On trying to join the cluster created on the DL380, from a Z6 machine (also with multiple network interfaces, of which 'eth2' is the main active interface), using:

crm cluster join -c dl-380-host

the cluster initialization framework throws the following error:

ERROR: cluster.join: Failed to run su root -c 'ssh -o StrictHostKeyChecking=no root@dl-380-host sudo crm cluster init -i eth2 ssh_remote': b'\x1b[31mERROR\x1b[0m: cluster.init: Failed to detect IP address for eth2\n'

However, on the dl-380-host, 'eth2' is not yet configured. On the Z6 host, eth2 is set up and configured for network access.

I had a look at crmsh/bootstrapy.py and found that if I changed line 1719-1720 from:

"ssh {} {}@{} sudo crm cluster init -i {} ssh_remote".format(
             SSH_OPTION, seed_user, seed_host, _context.default_nic_list[0],

to:

"ssh {} {}@{} sudo crm cluster init ssh_remote".format(
             SSH_OPTION, seed_user, seed_host,

and, line 1797, from:

cmd = "crm cluster init -i {} csync2_remote {}".format(_context.default_nic_list[0], utils.this_node())

to:

cmd = "crm cluster init csync2_remote {}".format(utils.this_node())

i.e. removing the cluster initialization routines from probing interfaces, the Z6 can now join the cluster. I also tried the above cluster join command with eth0 and eth2 values, both of which failed, and threw the same error messages.

Can anyone speculate why probing the active network interfaces on the new machine(s) joining the cluster, propagates back to the cluster's host/origin node like this, and what might be a fix for this?

liangxin1300 commented 1 year ago

Hi @roopchansinghv

Did setup the first("main" as you said) node by "crm cluster init" or "crm cluster init -i " to specify the interface?

liangxin1300 commented 1 year ago

Hi @roopchansinghv

However, on the dl-380-host, 'eth2' is not yet configured

For ha cluster, I suggest each node should have the same network configuration, like

Node1:

roopchansinghv commented 1 year ago

Hi @liangxin1300 - thank you for your replies.

To follow up on your question - No, I did not use the '-i' option when I initialized the cluster's head / primary node. Would this propagate down, and allow the other nodes added to pick this up properly when they are added, but might not have the same network configurations?

Also - it's never guaranteed that all nodes/added hardware will have identical configurations. We are looking for options to keep services running and available, not only across hardware failures, but also across upgrades - where newer hardware may be really different from the current hardware.

nicholasyang2022 commented 4 months ago

@liangxin1300 Does #1347 fixes this problem?

nicholasyang2022 commented 4 months ago

Fixed in #1347.