Closed karasjoh000 closed 3 years ago
@karasjoh000 that's probably because of this :
20 slow ops, oldest one blocked for 1609 sec, daemons [osd.36,osd.37,osd.46] have slow ops.
could you check the output of :
ceph pg dump
ceph osd dump
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
I encountered the same error.
group_vars/all.yml
ceph_origin: repository
ceph_repository: community
ceph_stable_release: pacific
public_network: "10.0.0.0/24"
radosgw_interface: eth0
monitor_interface: eth0
dashboard_enabled: true
containerized_deployment: false
devices:
- '/dev/vdb'
- '/dev/vdc'
- '/dev/vdd'
- '/dev/vde'
monitoring_group_name: monitoring
dashboard_admin_password: xxx
grafana_admin_password: xxx
/etc/os-release
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
Full log: https://textbin.net/raw/1owjr4ovud
The solution was to do sudo ceph osd pool set (pool name) size 2
for each pool, doing this again each time a new pool was created. It seems ceph did not want to acknowledge the write until the required replication factor was decreased.
However, my ceph -s
now looks like this:
sudo ceph -s
cluster:
id: 127318bb-031f-42c5-acc3-bc43e37ce230
health: HEALTH_WARN
mon is allowing insecure global_id reclaim
Degraded data redundancy: 192/384 objects degraded (50.000%), 46 pgs degraded, 161 pgs undersized
services:
mon: 1 daemons, quorum ceph-mon0 (age 59m)
mgr: ceph-mgr0(active, since 20m)
osd: 4 osds: 4 up (since 56m), 4 in (since 56m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 6 pools, 161 pgs
objects: 192 objects, 6.9 KiB
usage: 28 MiB used, 400 GiB / 400 GiB avail
pgs: 192/384 objects degraded (50.000%)
115 active+undersized
46 active+undersized+degraded
io:
client: 1.2 KiB/s rd, 1 op/s rd, 0 op/s wr
Since I have 4 OSDs, shouldn't the writes be able to go through at the default size of 3? Is this a bug or am I misunderstanding something?
could you share the output of ceph osd dump
?
epoch 77
fsid 127318bb-031f-42c5-acc3-bc43e37ce230
created 2021-06-03T20:19:45.629002+0000
modified 2021-06-04T00:03:12.171511+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 9
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release pacific
stretch_mode_enabled false
pool 1 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 77 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 50 flags hashpspool stripe_width 0 application rgw
pool 3 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 67 flags hashpspool stripe_width 0 application rbd
pool 4 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 43 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 51 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 pg_num_target 8 pgp_num_target 8 autoscale_mode on last_change 60 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
max_osd 4
osd.0 up in weight 1 up_from 18 up_thru 67 down_at 0 last_clean_interval [0,0) [v2:10.0.0.47:6800/17411,v1:10.0.0.47:6801/17411] [v2:10.0.0.47:6802/17411,v1:10.0.0.47:6803/17411] exists,up ba392249-5421-4de0-971a-2233b17da392
osd.1 up in weight 1 up_from 18 up_thru 67 down_at 0 last_clean_interval [0,0) [v2:10.0.0.47:6808/18825,v1:10.0.0.47:6809/18825] [v2:10.0.0.47:6810/18825,v1:10.0.0.47:6811/18825] exists,up e476f828-cd73-40b7-a926-393c93a43558
osd.2 up in weight 1 up_from 18 up_thru 67 down_at 0 last_clean_interval [0,0) [v2:10.0.0.47:6816/20248,v1:10.0.0.47:6817/20248] [v2:10.0.0.47:6818/20248,v1:10.0.0.47:6819/20248] exists,up 1d7a1700-91f9-4822-9546-a54a26f908de
osd.3 up in weight 1 up_from 18 up_thru 68 down_at 0 last_clean_interval [0,0) [v2:10.0.0.47:6824/21663,v1:10.0.0.47:6825/21663] [v2:10.0.0.47:6826/21663,v1:10.0.0.47:6827/21663] exists,up 7f125387-6bae-4a4c-a188-511b0f0ba082
blocklist 10.0.0.9:0/678357123 expires 2021-06-04T21:53:06.125600+0000
blocklist 10.0.0.9:0/2528750619 expires 2021-06-04T21:53:06.125600+0000
blocklist 10.0.0.9:0/3826862751 expires 2021-06-04T21:52:34.408413+0000
blocklist 10.0.0.9:0/558637544 expires 2021-06-04T20:58:55.166815+0000
blocklist 10.0.0.9:0/3912413307 expires 2021-06-04T21:52:34.408413+0000
blocklist 10.0.0.9:0/137647918 expires 2021-06-04T20:21:26.797963+0000
blocklist 10.0.0.9:6800/15427 expires 2021-06-04T20:21:26.797963+0000
blocklist 10.0.0.9:6801/15427 expires 2021-06-04T20:21:26.797963+0000
blocklist 10.0.0.9:0/128005271 expires 2021-06-04T20:21:26.797963+0000
blocklist 10.0.0.9:0/908547403 expires 2021-06-04T20:28:16.266869+0000
blocklist 10.0.0.9:0/2663080188 expires 2021-06-04T20:58:55.166815+0000
blocklist 10.0.0.9:0/807399896 expires 2021-06-04T20:28:16.266869+0000
@HashFail does your ceph-ansible
branch correspond with the ceph_stable_release: pacific
config? See https://docs.ceph.com/projects/ceph-ansible/en/latest/index.html#releases.
@karasjoh000 yes, the commit was the head of stable-6.0
at the time.
I think I might have figured out what is behind this: as you can see from my ceph osd dump
, all my OSDs are on the same host. Adding a second host seems to fix the issue.
Perhaps this is the expected behavior? Maybe when the pool size is greater than 2, ceph wants at least one of the OSDs to be on a different host?
@karasjoh000 yes, the commit was the head of
stable-6.0
at the time.I think I might have figured out what is behind this: as you can see from my
ceph osd dump
, all my OSDs are on the same host. Adding a second host seems to fix the issue.Perhaps this is the expected behavior? Maybe when the pool size is greater than 2, ceph wants at least one of the OSDs to be on a different host?
yes, this is expected behavior, your pool are all size 2
and you only have 1 osd node. Hence the 161 pgs undersized
Bug Report
What happened:
Once in a while, the installation hangs on
create dashboard admin user
and takes 40min+ to complete that particular task (ceph-dashboard-no-grafana
is a duplicate ofceph-dashboard
with grafana server installation removed) :ceph -s
What you expected to happen:
ceph mgr and rados-gw commands should not hang for over 40min each.
How to reproduce it (minimal and precise):
install ceph and once in a while, it will hang on that task. Roughly once every three installs that happens to me. Not sure if it is specific to the env or is an installation issue. Running 3 VMs on top of openstack with lvm mounts and installing ceph on those vms and volumes.
Share your group_vars files, inventory and full ceph-ansibe log
Environment:
uname -a
):Linux ceph-node-0 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
docker version
):Docker version 20.10.2, build 20.10.2-0ubuntu1~20.04.2
ansible-playbook --version
):git head or tag or stable branch
):commit e6d8b058ba92fecdc78ee55b0dd8ce12c5120df0
ceph -v
):ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)