canonical / microceph

Ceph for a one-rack cluster and appliances
https://snapcraft.io/microceph
GNU Affero General Public License v3.0
193 stars 27 forks source link

Flaky RGW CI test #326

Open UtkarshBhatthere opened 4 months ago

UtkarshBhatthere commented 4 months ago

Issue report

What version of MicroCeph are you using ?

Development Versions from active PRs.

What are the steps to reproduce this issue ?

This is a probabilistic issue but I have seen many instances of this failure.

What happens (observed behaviour) ?

shell: /usr/bin/bash -e {0}
+ lxc exec node-wrk0 -- sh -c '/mnt/actionutils.sh testrgw '
  cluster:
    id:     d6[4](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:5)02f43-0b46-48ef-91a3-38cedd71aaa2
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node-wrk0,node-wrk1,node-wrk2 (age 8s)
    mgr: node-wrk0(active, starting, since 11s), standbys: node-wrk1, node-wrk2
    osd: 3 osds: 3 up (since 2s), 3 in (since 76s)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    pools:   7 pools, 131 pgs
    objects: 19[5](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:6) objects, 454 KiB
    usage:   82 MiB used, 2.9 GiB / 3 GiB avail
    pgs:     1.527% pgs unknown
             3.053% pgs not active
             125 active+clean
             4   peering
             2   unknown

  progress:
    Global Recovery Event (0s)
      [............................] 

● snap.microceph.rgw.service - Service for snap application microceph.rgw
     Loaded: loaded (/etc/systemd/system/snap.microceph.rgw.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-03-0[6](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:7) 15:33:08 UTC; 32s ago
   Main PID: 6183 (radosgw)
      Tasks: 52 (limit: 19169)
     Memory: 31.9M
        CPU: 121ms
     CGroup: /system.slice/snap.microceph.rgw.service
             └─6183 radosgw -f --cluster ceph --name client.radosgw.gateway -c /var/snap/microceph/x1/conf/radosgw.conf

Mar 06 15:33:08 node-wrk0 systemd[1]: Started Service for snap application microceph.rgw.
Mar 06 15:33:29 node-wrk0 microceph.rgw[6183]: 2024-03-06T15:33:29.201+0000 [7](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:8)f91bac4a0c0 -1 asok(0x55ec0e1ee000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/snap/microceph/793/run/ceph-client.radosgw.gateway.61[8](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:9)3.[9](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:10)4472337561760.asok': (13) Permission denied

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please try reproducing the error using
  the latest s3cmd code from the git master
  branch found at:
    https://github.com/s3tools/s3cmd
  and have a look at the known issues list:
    https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions-(FAQ)
  If the error persists, please report the
  following lines (removing any private
  info as necessary) to:
   s3tools-bugs@lists.sourceforge.net

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Invoked as: /usr/bin/s3cmd --host localhost --host-bucket=localhost/%(bucket) --access_key=fooAccessKey --secret_key=fooSecretKey --no-ssl mb s3://testbucket
Problem: <class 'ConnectionRefusedError: [Errno 111] Connection refused
S3cmd:   2.2.0
python:   3.[10](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:11).12 (main, Nov 20 2023, 15:14:05) [GCC [11](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:12).4.0]
environment LANG=C.UTF-8

Traceback (most recent call last):
  File "/usr/bin/s3cmd", line 3209, in <module>
    rc = main()
  File "/usr/bin/s3cmd", line 3106, in main
    rc = cmd_func(args)
  File "/usr/bin/s3cmd", line 260, in cmd_bucket_create
    response = s3.bucket_create(uri.bucket(), cfg.bucket_location, cfg.extra_headers)
  File "/usr/lib/python3/dist-packages/S3/S3.py", line 430, in bucket_create
    response = self.send_request(request)
  File "/usr/lib/python3/dist-packages/S3/S3.py", line [14](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:15)80, in send_request
    conn = ConnMan.get(self.get_hostname(resource['bucket']))
  File "/usr/lib/python3/dist-packages/S3/ConnMan.py", line [28](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:29)4, in get
    conn.c.connect()
  File "/usr/lib/python3.10/http/client.py", line 942, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.10/socket.py", line 845, in create_connection
    raise err
  File "/usr/lib/python3.10/socket.py", line 8[33](https://github.com/canonical/microceph/actions/runs/8174334470/job/22349159128?pr=325#step:17:34), in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please try reproducing the error using
  the latest s3cmd code from the git master
  branch found at:
    https://github.com/s3tools/s3cmd
  and have a look at the known issues list:
    https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions-(FAQ)
  If the error persists, please report the
  above lines (removing any private
  info as necessary) to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

What were you expecting to happen ?

The RGW client test to be successfull.

sabaini commented 4 months ago

Ack, thanks for reporting -- I've seen this a few times as well. Maybe need to update timeouts, resp. wait longer for RGW to come up?