Seagate / cortx-hare

CORTX Hare configures Motr object store, starts/stops Motr services, and notifies Motr of service and device faults.
https://github.com/Seagate/cortx
Apache License 2.0
13 stars 80 forks source link

Problem: bootstrap fails with CentOS 7.6 and lustre-2.12 #488

Closed mssawant closed 4 years ago

mssawant commented 4 years ago

I'm trying to follow EES HA HOWTO.

Initially, I couldn't add another IP to LNet with Lustre 2.12 and CentOS 7.6 — LNet failed to configure with multiple IP addresses.

I was able to add another IP to LNet after creating a stub of /etc/lnet.conf and downgrading Lustre,

[vagrant@pod-c1 ~]$ sudo lctl list_nids
172.28.128.4@tcp
172.28.128.101@tcp
[vagrant@pod-c2 ~]$ sudo lctl list_nids
172.28.128.5@tcp
172.28.128.102@tcp

but then bootstrap failed with another error:

Nov 22 07:26:53 pod-c1 hare-consul[6247]: 2019/11/22 07:26:53 [WARN] agent: Check "service:9" is now warning
Nov 22 07:26:53 pod-c1 kernel: m0mero: disagrees about version of symbol LNetEQAlloc
Nov 22 07:26:53 pod-c1 kernel: m0mero: Unknown symbol LNetEQAlloc (err -22)
Nov 22 07:26:53 pod-c1 kernel: m0mero: disagrees about version of symbol LNetGet
Nov 22 07:26:53 pod-c1 kernel: m0mero: Unknown symbol LNetGet (err -22)
Nov 22 07:26:53 pod-c1 mero-kernel[7065]: insmod: ERROR: could not insert module /data/mero/m0mero.ko: Invalid parameters
Nov 22 07:26:53 pod-c1 mero-kernel[7065]: Failed to load m0mero module
Nov 22 07:26:53 pod-c1 systemd[1]: mero-kernel.service: main process exited, code=exited, status=1/FAILURE
Nov 22 07:26:53 pod-c1 systemd[1]: Failed to start Mero kernel module.

Looks like

[vagrant@pod-c1 ~]$ rpm -qa | grep lustre
kmod-lustre-client-2.10.8-1.el7.x86_64
lustre-client-2.10.8-1.el7.x86_64

is incompatible with

[vagrant@pod-c1 ~]$ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
mssawant commented 4 years ago

changed the description

mssawant commented 4 years ago

changed the description

mssawant commented 4 years ago

closed

mssawant commented 4 years ago

No, after the steps I mentioned above, one can add multiple ips to lnet using CentOS 7.6 and Lustre 2.10.8. Closing the issue. I have not checked with the new changes to m0vg. But will close this issue as the particular problem has a solution.

vvv commented 4 years ago

@mandar.sawant Is this still a problem? @dmitriy.chumak has updated m0vg; see http://gitlab.mero.colo.seagate.com/mero/mero/commit/15e210fc5dee49f40e14b2d378c1b3fac44cb2c2.

vvv commented 4 years ago

changed the description

vvv commented 4 years ago

changed title from Problem: {-Cluster -}bootstrap fails with CentOS 7.6 and lustre-2.12 to Problem: bootstrap fails with CentOS 7.6 and lustre-2.12

vvv commented 4 years ago

changed the description

vvv commented 4 years ago

changed title from Problem: Cluster bootstrap fails with CentOS 7.6 and lustre-2.12{- as lnet fails to configure with multiple ip addresses.-} to Problem: Cluster bootstrap fails with CentOS 7.6 and lustre-2.12

mssawant commented 4 years ago

changed title from Problem: Cluster bootstrap fails with CentOS 7.6 and lustre-2.12 as lnet fail{-ed-} to configure with multiple ip addresses. to Problem: Cluster bootstrap fails with CentOS 7.6 and lustre-2.12 as lnet fail{+s+} to configure with multiple ip addresses.

mssawant commented 4 years ago

changed title from Problem: {-lustre client 2.10 and 2.12 does not work with CentOS Linux release 7.6.1810 (Core)-} to Problem: {+Cluster bootstrap fails with CentOS 7.6 and lustre-2.12 as lnet failed to configure with multiple ip addresses.+}

mssawant commented 4 years ago

changed title from lustre client 2.10 and 2.12 does not work with CentOS Linux release 7.6.1810 (Core) to {+Problem: +}lustre client 2.10 and 2.12 does not work with CentOS Linux release 7.6.1810 (Core)

mssawant commented 4 years ago

After rebuilding Mero with lustre-2.10.8 and downgrading pods from lustre-2.12 to lustre-2.10 on CentOS 7.6, bootstrap of HA setup succeeded with multiple ips configured in lnet.

[vagrant@pod-c1 ~]$ hctl bootstrap --mkfs ees-cluster.yaml
2019-11-22 11:02:58: Generating cluster configuration... /usr/share/ruby/vendor_ruby/facter/core/execution/posix.rb:9: warning: Insecure world writable dir /data/mero/utils in PATH, mode 040777
Ok.
2019-11-22 11:03:00: Starting Consul server agent on this node......... Ok.
2019-11-22 11:03:08: Importing configuration into the KV Store... Ok.
2019-11-22 11:03:08: Starting Consul agents on remaining cluster nodes.... Ok.
2019-11-22 11:03:10: Update Consul agents configs from the KV Store... Ok.
2019-11-22 11:03:12: Waiting for the RC Leader to be elected..... Ok.
2019-11-22 11:03:16: Starting Mero (phase1)... Ok.
2019-11-22 11:03:25: Starting Mero (phase2)... Ok.
2019-11-22 11:03:34: Checking the health of the services... Ok.
[vagrant@pod-c1 ~]$ sudo lctl list_nids
172.28.128.4@tcp
172.28.128.101@tcp
[vagrant@pod-c1 ~]$ ps -aux | grep m0d
root     13286  0.2 10.1 3316444 294664 ?      SLsl 11:03   0:00 /data/mero/mero/.libs/lt-m0d -e lnet:172.28.128.101@tcp:12345:2:1 -f <0x7200000000000001:0x9> -T linux -S stobs -D db -A linuxstob:addb-stobs -m 65536 -q 16 -w 8 -c /etc/mero/confd.xc -H 172.28.128.101@tcp:12345:1:1 -U
root     13696  0.2 12.4 3447800 363936 ?      SLsl 11:03   0:00 /data/mero/mero/.libs/lt-m0d -e lnet:172.28.128.101@tcp:12345:2:2 -f <0x7200000000000001:0xc> -T ad -S stobs -D db -A linuxstob:addb-stobs -m 65536 -q 16 -w 8 -H 172.28.128.101@tcp:12345:1:1 -U
[vagrant@pod-c2 ~]$ ps -aux | grep m0d
root      7687  0.2 12.5 3447800 364648 ?      SLsl 11:03   0:00 /data/mero/mero/.libs/lt-m0d -e lnet:172.28.128.102@tcp:12345:2:1 -f <0x7200000000000001:0x28> -T ad -S stobs -D db -A linuxstob:addb-stobs -m 65536 -q 16 -w 8 -H 172.28.128.102@tcp:12345:1:1 -U
mssawant commented 4 years ago

changed the description