Seagate / cortx-hare

CORTX Hare configures Motr object store, starts/stops Motr services, and notifies Motr of service and device faults.
https://github.com/Seagate/cortx
Apache License 2.0
13 stars 80 forks source link

Problem : Bootstrap fails with error "Failed to start hare-hax.service: Unit not found." #600

Closed Kalpesh-Chhajed closed 4 years ago

Kalpesh-Chhajed commented 4 years ago
[root@eosnode-1 ~]# hctl bootstrap --mkfs ~/CDF.yaml
2020-01-08 19:44:39: Generating cluster configuration... Ok.
2020-01-08 19:44:40: Starting Consul server agent on this node........... Ok.
2020-01-08 19:44:49: Importing configuration into the KV Store... Ok.
2020-01-08 19:44:49: Starting Consul agents on remaining cluster nodes... Ok.
2020-01-08 19:44:49: Update Consul agents configs from the KV Store... Ok.
2020-01-08 19:44:50: Install Mero configuration files... Ok.
2020-01-08 19:44:50: Waiting for the RC Leader to get elected........ Ok.
2020-01-08 19:44:56: Starting Mero (phase1, mkfs)...** Failed to start hare-hax.service: Unit not found.**
[root@eosnode-1 ~]# hctl status
Profile: 0x7000000000000001:0x22
Data Pools:
    0x6f00000000000001:0x23
Services:
    eosnode-1  (RC)
    [offline   ] hax                  0x7200000000000001:0x6         10.237.65.176@tcp:12345:1:1
    [offline   ] confd                0x7200000000000001:0x9         10.237.65.176@tcp:12345:2:1
    [offline   ] ioservice            0x7200000000000001:0xc         10.237.65.176@tcp:12345:2:2
    [offline   ] s3server             0x7200000000000001:0x16        10.237.65.176@tcp:12345:3:1
    [offline   ] s3server             0x7200000000000001:0x19        10.237.65.176@tcp:12345:3:2
    [unknown   ] m0_client            0x7200000000000001:0x1c        10.237.65.176@tcp:12345:4:1
    [unknown   ] m0_client            0x7200000000000001:0x1f        10.237.65.176@tcp:12345:4:2
rajanikantchirmade commented 4 years ago

@kalpesh.chhajed As part of fix for this issue, need to update README.

Kalpesh-Chhajed commented 4 years ago

Thanks for the clarification. @rajanikant.chirmade

shall we update the README file with proper steps?

rajanikantchirmade commented 4 years ago

@kalpesh.chhajed Two things

  1. Install all rpms from same repo (centos-7.7), No need to install kmod-lustre-client separately. Installing hare installs all dependencies (lustre-client, Mero etc)
  2. lnet was not started. Need to configure using (sudo lctl network configure)
Kalpesh-Chhajed commented 4 years ago

Hello @rajanikant.chirmade : Do you see any delta in version or steps i used while installation?

rajanikantchirmade commented 4 years ago

Two problem I found :

  1. While uploading object to S3 server it failed and IO service crashed.

  2. After cluster shutdown, re-bootstrap failing to elect leader.

rajanikantchirmade commented 4 years ago

I am able to bootstrap cluster with S3 servers on H/W provided by @kalpesh.chhajed

Installed rpms from centos-7.7 repo

[root@eosnode-1 ~]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[root@eosnode-1 ~]# uname -a
Linux eosnode-1 3.10.0-1062.el7.x86_64 #23 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@eosnode-1 ~]#
[root@eosnode-1 ~]#  yum repolist enabled | grep ci-storage.mero.colo.seagate.com
ci-storage.mero.colo.seagate.com_releases_eos_integration_centos-7.7.1908_last_successful_      43
ci-storage.mero.colo.seagate.com_releases_eos_s3server_uploads                                  35
[root@eosnode-1 ~]#yum install -y --nogpgcheck hare
[root@eosnode-1 ~]#yum install -y --nogpgcheck s3server
[root@eosnode-1 ~]#/opt/seagate/hare/libexec/hare/s3auth-disable
[root@eosnode-1 ~]#modprobe lnet
[root@eosnode-1 ~]#sudo lctl network configure
[root@eosnode-1 ~]#sudo lctl list_nids
10.237.65.176@tcp
[root@eosnode-1 ~]# cat CDF.yaml
nodes:
  - hostname: eosnode-1
    data_iface: eno1
    m0_servers:
      - runs_confd: true
      - io_disks: { path_glob: "/dev/sda"}
    m0_clients:
        s3: 2
        other: 2
pools:
  - name: the pool
    disks: all
    data_units: 1
    parity_units: 0
    # allowed_failures: { site: 0, rack: 0, encl: 0, ctrl: 0, disk: 0 }
[root@eosnode-1 ~]#

Bootstrap succeed with s3 server instances. But S3 upload failed (see snap of s3server errors)

[root@eosnode-1 ~]# hctl bootstrap --mkfs CDF.yaml
2020-01-09 09:02:29: Generating cluster configuration... Ok.
2020-01-09 09:02:30: Starting Consul server agent on this node.......... Ok.
2020-01-09 09:02:38: Importing configuration into the KV Store... Ok.
2020-01-09 09:02:38: Starting Consul agents on remaining cluster nodes... Ok.
2020-01-09 09:02:38: Update Consul agents configs from the KV Store... Ok.
2020-01-09 09:02:39: Install Mero configuration files... Ok.
2020-01-09 09:02:40: Waiting for the RC Leader to get elected..... Ok.
2020-01-09 09:02:42: Starting Mero (phase1, mkfs)... Ok.
2020-01-09 09:02:49: Starting Mero (phase1, m0d)... Ok.
2020-01-09 09:02:52: Starting Mero (phase2, mkfs)... Ok.
2020-01-09 09:02:57: Starting Mero (phase2, m0d)... Ok.
2020-01-09 09:03:01: Starting S3 servers (phase3)... Ok.
2020-01-09 09:03:02: Checking health of the services... Ok.
[root@eosnode-1 ~]# hctl status
Profile: 0x7000000000000001:0x22
Data Pools:
    0x6f00000000000001:0x23
Services:
    eosnode-1  (RC)
    [started   ] hax                  0x7200000000000001:0x6         10.237.65.176@tcp:12345:1:1
    [started   ] confd                0x7200000000000001:0x9         10.237.65.176@tcp:12345:2:1
    [started   ] ioservice            0x7200000000000001:0xc         10.237.65.176@tcp:12345:2:2
    [started   ] s3server             0x7200000000000001:0x16        10.237.65.176@tcp:12345:3:1
    [started   ] s3server             0x7200000000000001:0x19        10.237.65.176@tcp:12345:3:2
    [unknown   ] m0_client            0x7200000000000001:0x1c        10.237.65.176@tcp:12345:4:1
    [unknown   ] m0_client            0x7200000000000001:0x1f        10.237.65.176@tcp:12345:4:2
[root@eosnode-1 ~]# ps -xa | grep m0d
 30257 ?        SLsl   0:01 /usr/bin/m0d -e lnet:10.237.65.176@tcp:12345:2:1 -f <0x7200000000000001:0x9> -T linux -S stobs -D db -A linuxstob:addb-stobs -m 65536 -q 16 -w 8 -c /etc/mero/confd.xc -H 10.237.65.176@tcp:12345:1:1 -U
 31301 ?        SLsl   0:01 /usr/bin/m0d -e lnet:10.237.65.176@tcp:12345:2:2 -f <0x7200000000000001:0xc> -T ad -S stobs -D db -A linuxstob:addb-stobs -m 65536 -q 16 -w 8 -H 10.237.65.176@tcp:12345:1:1 -U
 32733 pts/0    S+     0:00 grep --color=auto m0d
[root@eosnode-1 ~]# ps -xa | grep s3server
 31957 ?        SLsl   0:00 s3server --s3pidfile /var/run/s3server.0x7200000000000001:0x16.pid --clovislocal 10.237.65.176@tcp:12345:3:1 --clovisha 10.237.65.176@tcp:12345:1:1 --clovisprofilefid <0x7000000000000001:0x22> --clovisprocessfid <0x7200000000000001:0x16> --s3port 8081 --log_dir /var/log/seagate/s3/s3server-0x7200000000000001:0x16 --disable_auth=true
 32125 ?        SLsl   0:00 s3server --s3pidfile /var/run/s3server.0x7200000000000001:0x19.pid --clovislocal 10.237.65.176@tcp:12345:3:2 --clovisha 10.237.65.176@tcp:12345:1:1 --clovisprofilefid <0x7000000000000001:0x22> --clovisprocessfid <0x7200000000000001:0x19> --s3port 8082 --log_dir /var/log/seagate/s3/s3server-0x7200000000000001:0x19 --disable_auth=true
 32831 pts/0    S+     0:00 grep --color=auto s3server
[root@eosnode-1 ~]# s3cmd ls
[root@eosnode-1 ~]# s3cmd mb s3://seagate
Bucket 's3://seagate/' created
[root@eosnode-1 ~]# s3cmd ls
2020-01-09 14:03  s3://seagate
[root@eosnode-1 ~]# s3cmd ls s3://seagate
[root@eosnode-1 ~]# vi ~/.s3cfg
[root@eosnode-1 ~]# s3cmd put s3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm s3://seagate
upload: 's3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm' -> 's3://seagate/s3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm'  [1 of 1]
  7536640 of 11413320    66% in    0s    77.51 MB/s  failed
  7536640 of 11413320    66% in    0s    73.23 MB/s  done
WARNING: Upload failed: /s3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm (500 (InternalError): We encountered an internal error. Please try again.)
WARNING: Waiting 3 sec...
upload: 's3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm' -> 's3://seagate/s3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm'  [1 of 1]
  7995392 of 11413320    70% in    1s     6.86 MB/s  failed
  7995392 of 11413320    70% in    1s     6.85 MB/s  done
WARNING: Upload failed: /s3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm (500 (InternalError): We encountered an internal error. Please try again.)
WARNING: Waiting 6 sec...
upload: 's3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm' -> 's3://seagate/s3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm'  [1 of 1]
  8323072 of 11413320    72% in    0s    74.94 MB/s  failed
  8323072 of 11413320    72% in    0s    73.75 MB/s  done
WARNING: Upload failed: /s3server-1.0.0-B64731_git00f328b_el7.x86_64.rpm (500 (InternalError): We encountered an internal error. Please try again.)
WARNING: Waiting 9 sec...
^CSee ya!
[root@eosnode-1 ~]# hctl status
Profile: 0x7000000000000001:0x22
Data Pools:
    0x6f00000000000001:0x23
Services:
    eosnode-1  (RC)
    [started   ] hax                  0x7200000000000001:0x6         10.237.65.176@tcp:12345:1:1
    [started   ] confd                0x7200000000000001:0x9         10.237.65.176@tcp:12345:2:1
    [started   ] ioservice            0x7200000000000001:0xc         10.237.65.176@tcp:12345:2:2
    [started   ] s3server             0x7200000000000001:0x16        10.237.65.176@tcp:12345:3:1
    [started   ] s3server             0x7200000000000001:0x19        10.237.65.176@tcp:12345:3:2
    [unknown   ] m0_client            0x7200000000000001:0x1c        10.237.65.176@tcp:12345:4:1
    [unknown   ] m0_client            0x7200000000000001:0x1f        10.237.65.176@tcp:12345:4:2
[root@eosnode-1 ~]# cat /var/log/seagate/s3/s3server-0x7200000000000001\:0x16/s3server.ERROR
Log file created at: 2020/01/09 09:05:17
Running on machine: eosnode-1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0109 09:05:17.188256 31957 s3_clovis_writer.cc:451] [write_content_failed] [ReqID: 17f59308-bcd3-4de1-8d83-315df7427c15] Write to object failed after writing 0
E0109 09:05:21.308351 31957 s3_clovis_writer.cc:451] [write_content_failed] [ReqID: fcb2d449-a477-4971-8970-bc20eada3b24] Write to object failed after writing 0
E0109 09:05:27.424580 31957 s3_clovis_writer.cc:451] [write_content_failed] [ReqID: 9bcc52c3-17cf-4629-9881-70a30956cd39] Write to object failed after writing 0
[root@eosnode-1 ~]# hctl status
Profile: 0x7000000000000001:0x22
Data Pools:
    0x6f00000000000001:0x23
Services:
    eosnode-1  (RC)
    [started   ] hax                  0x7200000000000001:0x6         10.237.65.176@tcp:12345:1:1
    [started   ] confd                0x7200000000000001:0x9         10.237.65.176@tcp:12345:2:1
    [offline   ] ioservice            0x7200000000000001:0xc         10.237.65.176@tcp:12345:2:2
    [started   ] s3server             0x7200000000000001:0x16        10.237.65.176@tcp:12345:3:1
    [started   ] s3server             0x7200000000000001:0x19        10.237.65.176@tcp:12345:3:2
    [unknown   ] m0_client            0x7200000000000001:0x1c        10.237.65.176@tcp:12345:4:1
    [unknown   ] m0_client            0x7200000000000001:0x1f        10.237.65.176@tcp:12345:4:2
[root@eosnode-1 ~]# hctl shutdown
Stopping s3server@0x7200000000000001:0x16 at eosnode-1... done
Stopping s3server@0x7200000000000001:0x19 at eosnode-1... done
Stopping m0d@0x7200000000000001:0x9 (confd) at eosnode-1... done
Stopping hare-hax at eosnode-1... done
Stopping hare-consul-agent at eosnode-1... done
Killing RC Leader at eosnode-1... done
[root@eosnode-1 ~]# hctl status
Cluster is not running.
[root@eosnode-1 ~]#
rajanikantchirmade commented 4 years ago

assigned to @rajanikant.chirmade and unassigned @kalpesh.chhajed

shailesh-vaidya commented 4 years ago

@vvv Hare rpm creation includes following two stages.

  1. Preparing Docker image with required dependencies. Dockerfile used to create Docker Image is located at http://gitlab.mero.colo.seagate.com/eos/re/docker/blob/master/hare/C7.7.1908/Dockerfile
  2. Jenkins job for Build#21 is at http://eos-jenkins.mero.colo.seagate.com/view/CentOS%207.7.1908/job/centos-7.7.1980/job/hare-dev-pipeline-centos7.7.1908/21/console

This job follow below steps,

  1. Clones latest code from http://gitlab.mero.colo.seagate.com/mero/hare.git
  2. Installs mero mero-devel rpms.
  3. Executes 'make rpm' command.
  4. Then generated rpm's are copied to http://ci-storage.mero.colo.seagate.com/releases/eos/components/dev/centos-7.7.1908/hare/last_successful/
vvv commented 4 years ago

@shailesh.vaidya Can you refer me to the code that built hare-0.1.0-21_gitb9c8f51_m0git973035e25.el7.x86_64 rpm?

vvv commented 4 years ago

assigned to @kalpesh.chhajed

vvv commented 4 years ago

@kalpesh.chhajed Thanks for reporting the problem!

Can you show the output of the following shell snippet please?

rpm -ql hare
for f in /etc/yum.repos.d/*; do
    if grep -q lustre-local $f; then
        echo "### $f"
        cat $f
    fi
done

(The code should be executed at eosnode-1.)

Kalpesh-Chhajed commented 4 years ago

CDF file i am using is

nodes:
  - hostname: eosnode-1
    data_iface: eno1
    m0_servers:
      - runs_confd: true
      - io_disks: { path_glob: "/dev/sda"}
    m0_clients:
        s3: 2
        other: 2
pools:
  - name: the pool
    disks: all
    data_units: 1
    parity_units: 0
Kalpesh-Chhajed commented 4 years ago

changed title from Problem : Bootstrap fails w{-tih-} "Failed to start hare-hax.service: Unit not found." to Problem : Bootstrap fails w{+ith error+} "Failed to start hare-hax.service: Unit not found."

Kalpesh-Chhajed commented 4 years ago

@vvv @rajanikant.chirmade

Kalpesh-Chhajed commented 4 years ago

I was trying installation on HW having below config

[root@eosnode-1 ~]# uname -r
3.10.0-1062.el7.x86_64

[root@eosnode-1 ~]# cat /etc/system-release
CentOS Linux release 7.7.1908 (Core)
[root@eosnode-1 ~]# rpm -qa | grep lustre
lustre-client-dkms-2.10.4-1.el7.noarch
kmod-lustre-client-2.12.3-1.el7.x86_64

[root@eosnode-1 ~]# rpm -qa | grep mero
mero-1.4.0-11_git973035e25_3.10.0_1062.el7.x86_64

[root@eosnode-1 ~]# rpm -qa | grep hare
perl-threads-shared-1.43-6.el7.x86_64
shared-mime-info-1.8-4.el7.x86_64
hare-0.1.0-21_gitb9c8f51_m0git973035e25.el7.x86_64

[root@eosnode-1 ~]# rpm -qa | grep s3server
s3server-1.0.0-31_git0730db5_el7.x86_64
stale[bot] commented 4 years ago

There's been no activity on this issue for 345600 seconds (that's 4 days for you, hoomans).
Let me ping some Hare maintainers on your behalf... @mssawant, @vvv: Hello there! :wave: OK, done.
I've also set needs-attention label. Is this worth it? I don't know. But I'm keen to find out! (Oh, who am I kidding. I'm a stateless bot. All those moments will be lost in time, like tears in rain.)
Sorry for the delay. And thank you for contributing to CORTX! (Not bad for a human.)

stale[bot] commented 4 years ago

There's been no activity on this issue for 345600 seconds (that's 4 days for you, hoomans).
Let me ping some Hare maintainers on your behalf... @mssawant, @vvv: Hello there! :wave: OK, done.
I've also set needs-attention label. Is this worth it? I don't know. But I'm keen to find out! (Oh, who am I kidding. I'm a stateless bot. All those moments will be lost in time, like tears in rain.)
Sorry for the delay. And thank you for contributing to CORTX! (Not bad for a human.)