it's breaking CRUSH rule

dthpulse commented 2 months ago

Hi

on Ceph Quincy 17.2.7, with EC pool using CRUSH rule:

{
    "rule_id": 10,
    "rule_name": "ec33hdd_rule",
    "type": 3,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -2,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_indep",
            "num": 3,
            "type": "datacenter"
        },
        {
            "op": "choose_indep",
            "num": 2,
            "type": "osd"
        },
        {
            "op": "emit"
        }
    ]
}

EC profile:

crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=3
plugin=jerasure
technique=reed_sol_van
w=8

I originaly have PGs distributed over 2 OSDs per DC, but after running this balancer I found a lot of PGs this distribution is broken. In some DC there are 3 OSDs now and only 1 OSD on other.

Looks to me like it's ignoring custom CRUSH rule for EC pools.

Also strange that pg-upmap-items is allowing this. As according docs it shouldn't run if it's breaking CRUSH rule.

Let me know if you need more details to debug, but currently I wrote little script to fix this issue on my cluster.

Thank you!

hydro-b commented 1 month ago

I want to add another violation of CRUSH rule in a stretch mode (dual Data Center) setup. Note: stretch mode does not need to be enabled for this issue to occur, just the CRUSH rule should be in use. The result is that some PGs live on 3 OSDs in one data center, and on just one in another. This results in inactive PGs when one data center is offline (hit this issue in production before stretch mode was enabled and min_size=2 was still enforced on the pool).

CRUSH rule:

rule stretch_replicated_rule {
    id 3
    type replicated
    step take default
    step choose firstn 0 type datacenter
    step choose firstn 0 type host
    step chooseleaf firstn 2 type osd
    step emit
}

{
        "rule_id": 3,
        "rule_name": "stretch_replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "datacenter"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "osd"
            },
            {
                "op": "emit"
            }
        ]
    }

Tested on Ceph 18.2.1 and 18.2.2. Example:

ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                STATUS  REWEIGHT  PRI-AFF
 -1         0.78394  root default                                      
-10         0.39197      datacenter DC1                             
 -3         0.39197          host host1                           
  0    hdd  0.09798              osd.0            up   1.00000  1.00000
  1    hdd  0.09798              osd.1            up   1.00000  1.00000
  4    ssd  0.09798              osd.4            up   1.00000  1.00000
  5    ssd  0.09798              osd.5            up   1.00000  1.00000
-11         0.39197      datacenter DC2                             
 -5         0.39197          host host2                           
  2    hdd  0.09798              osd.2            up   1.00000  1.00000
  3    hdd  0.09798              osd.3            up   1.00000  1.00000
  6    ssd  0.09798              osd.6            up   1.00000  1.00000
  7    ssd  0.09798              osd.7            up   1.00000  1.00000

./placementoptimizer.py -v balance --max-pg-moves 10
...
output omitted
...
ceph osd pg-upmap-items 3.38 0 2

ceph osd pg-upmap-items 3.38 0 2
ceph pg ls |grep ^3.38
3.38       17         0          0        0  67375104            0           0   9306         0  active+clean    33s   1296'9306  1595:31340  [3,6,5,2]p3  [3,6,5,2]p3  2024-06-16T23:40:43.524716+0200  2024-06-14T15:41:41.842701+0200                    1  periodic scrub scheduled @ 2024-06-18T00:44:32.626640+0200

As @dthpulse mentions, AFAIK this upmap should be ignored by Ceph as it violates the CRUSH policy for the pool. I will verify this with Ceph developers.

Adding insult to injury (IMHO) this issue will go unnoticed when stretch mode is enabled as currently (as it stands) the min_size gets set to 1 in stretch degraded mode (I made this issue[https://tracker.ceph.com/issues/64842] to fix that).

TheJJ commented 4 weeks ago

oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?

hydro-b commented 4 weeks ago

oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?

I have sent you the state file by mail. Thanks for looking into it.

TheJJ / ceph-balancer

it's breaking CRUSH rule #41