Open dthpulse opened 2 months ago
I want to add another violation of CRUSH rule in a stretch mode (dual Data Center) setup. Note: stretch mode does not need to be enabled for this issue to occur, just the CRUSH rule should be in use. The result is that some PGs live on 3 OSDs in one data center, and on just one in another. This results in inactive PGs when one data center is offline (hit this issue in production before stretch mode was enabled and min_size=2 was still enforced on the pool).
CRUSH rule:
rule stretch_replicated_rule {
id 3
type replicated
step take default
step choose firstn 0 type datacenter
step choose firstn 0 type host
step chooseleaf firstn 2 type osd
step emit
}
{
"rule_id": 3,
"rule_name": "stretch_replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "choose_firstn",
"num": 0,
"type": "datacenter"
},
{
"op": "choose_firstn",
"num": 0,
"type": "host"
},
{
"op": "chooseleaf_firstn",
"num": 2,
"type": "osd"
},
{
"op": "emit"
}
]
}
Tested on Ceph 18.2.1 and 18.2.2. Example:
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.78394 root default
-10 0.39197 datacenter DC1
-3 0.39197 host host1
0 hdd 0.09798 osd.0 up 1.00000 1.00000
1 hdd 0.09798 osd.1 up 1.00000 1.00000
4 ssd 0.09798 osd.4 up 1.00000 1.00000
5 ssd 0.09798 osd.5 up 1.00000 1.00000
-11 0.39197 datacenter DC2
-5 0.39197 host host2
2 hdd 0.09798 osd.2 up 1.00000 1.00000
3 hdd 0.09798 osd.3 up 1.00000 1.00000
6 ssd 0.09798 osd.6 up 1.00000 1.00000
7 ssd 0.09798 osd.7 up 1.00000 1.00000
./placementoptimizer.py -v balance --max-pg-moves 10
...
output omitted
...
ceph osd pg-upmap-items 3.38 0 2
ceph osd pg-upmap-items 3.38 0 2
ceph pg ls |grep ^3.38
3.38 17 0 0 0 67375104 0 0 9306 0 active+clean 33s 1296'9306 1595:31340 [3,6,5,2]p3 [3,6,5,2]p3 2024-06-16T23:40:43.524716+0200 2024-06-14T15:41:41.842701+0200 1 periodic scrub scheduled @ 2024-06-18T00:44:32.626640+0200
As @dthpulse mentions, AFAIK this upmap should be ignored by Ceph as it violates the CRUSH policy for the pool. I will verify this with Ceph developers.
Adding insult to injury (IMHO) this issue will go unnoticed when stretch mode is enabled as currently (as it stands) the min_size gets set to 1 in stretch degraded mode (I made this issue[https://tracker.ceph.com/issues/64842] to fix that).
oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?
oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?
I have sent you the state file by mail. Thanks for looking into it.
Hi
on Ceph Quincy 17.2.7, with EC pool using CRUSH rule:
EC profile:
I originaly have PGs distributed over 2 OSDs per DC, but after running this balancer I found a lot of PGs this distribution is broken. In some DC there are 3 OSDs now and only 1 OSD on other.
Looks to me like it's ignoring custom CRUSH rule for EC pools.
Also strange that
pg-upmap-items
is allowing this. As according docs it shouldn't run if it's breaking CRUSH rule.Let me know if you need more details to debug, but currently I wrote little script to fix this issue on my cluster.
Thank you!