akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

provider: persistent storage reporting should accurately reflect available Ceph space #146

Open andy108369 opened 7 months ago

andy108369 commented 7 months ago

The current reporting of persistent storage available space by the provider, based on Ceph's MAX AVAIL, is not accurate.

This is due to Ceph's MAX AVAIL being a dynamic value that represents MAX - USED, and it decreases as storage is used. Consequently, this results in the provider sometimes reporting less available space than actually exists.

A key point of confusion arises with Kubernetes' PV (Persistent Volume) system. In Kubernetes, when a PV or PVC (Persistent Volume Claim) is created, it doesn't immediately reserve physical space in Ceph. Therefore, Ceph's MAX AVAIL doesn't change upon the creation of these volumes, leading to a discrepancy. It's only when data is actually written to these volumes that Ceph's MAX AVAIL decreases accordingly.

To provide a more accurate view of the available space, the provider should modify its display metrics. Instead of relying on Ceph's MAX AVAIL, it should calculate the actual available space as [Total MAX space of Ceph] - [Reserved space in K8s (PV/PVC)]. Here, Total MAX space of Ceph should be considered as the entire storage capacity of the Ceph cluster without deducting the Ceph's USED amount (as what Ceph's MAX AVAIL does now) or the space reserved by Kubernetes PV/PVC. This approach will give a more realistic representation of the available storage, accounting for the Kubernetes-reserved space.

NOTE: Ceph's USED is the STORED x No_Replicas in Ceph, which means the available persistent storage can easily go negative as soon as more than half of space gets written to the persistent storage (with two replicas), or a quarter of that (with three replicas). See the example case from Hurricane provider is below (two replicas).


Tested Provider / Inventory Operator Versions

Scenario Illustration

  1. Initial Provider View (before deployment)

provider has 1 OSD (disk of 32Gi)

$ provider_info.sh provider.provider-02.sandbox-01.aksh.pw
type       cpu    gpu  ram                 ephemeral           persistent
used       0      0    0                   0                   0
pending    0      0    0                   0                   0
available  14.65  2    30.632808685302734  174.28484315704554  30.370879160240293
node       14.65  2    30.632808685302734  174.28484315704554  N/A
  1. After Creating a Deployment with 10Gi Persistent Storage
  1. Provider View Post-Deployment
$ provider_info.sh provider.provider-02.sandbox-01.aksh.pw
type       cpu    gpu  ram                 ephemeral           persistent
used       1      1    2                   5                   10
pending    0      0    0                   0                   0
available  13.65  1    28.632808685302734  169.28484315704554  30.0556707251817
node       13.65  1    28.632808685302734  169.28484315704554  N/A

ceph df also reports MAX AVAIL => 30Gi.

  1. Writing 9Gi of Data to PV
dd if=/dev/urandom bs=1M count=9216 of=1
  1. Views After Writing 9Gi of Data

--- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL akash-nodes 1 32 19 B 1 4 KiB 0 21 GiB akash-deployments 2 32 9.0 GiB 2.33k 9.0 GiB 29.71 21 GiB .mgr 3 1 449 KiB 2 452 KiB 0 21 GiB


- Provider View: Reflects `MAX AVAIL` aligned with Ceph's calculation  `MAX - USED`, i.e. `(21 Gi - 9 Gi = 12 Gi).`

$ provider_info.sh provider.provider-02.sandbox-01.aksh.pw type cpu gpu ram ephemeral persistent used 1 1 2 5 10 pending 0 0 0 0 0 available 13.65 1 28.632808685302734 169.28484315704554 12.302638040855527 node 13.65 1 28.632808685302734 169.28484315704554 N/A

andy108369 commented 7 months ago

Hurricane provider

This is the also reason Hurricane provider reports -567 Gi of persistent storage available:

Akash-Provider currently reports MAX AVAIL - USED => 393-960 => negative -567 Gi of available persistent storage.

Clarification: USED is STORED X No_Replicas here, i.e. 480 x 2 = 960 Gi (server has 2x 931.5G disks with two OSD's each 465.8G)

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1.8 TiB  898 GiB  965 GiB   965 GiB      51.79
TOTAL  1.8 TiB  898 GiB  965 GiB   965 GiB      51.79

--- POOLS ---
POOL               ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                1    1  449 KiB        2  904 KiB      0    393 GiB
akash-deployments   2  256  480 GiB  123.31k  960 GiB  54.98    393 GiB

$ provider_info.sh provider.hurricane.akash.pub
type       cpu     gpu  ram                ephemeral           persistent
used       58.6    0    169.5              746.5               550
pending    0       0    0                  0                   0
available  34.295  1    4.681840896606445  1062.2646561246365  -567.2483718525618
node       34.295  1    4.681840896606445  1062.2646561246365  N/A

Ceph config - 2 replicas:

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd pool get akash-deployments all
size: 2
min_size: 2
...

PVC

$ kubectl get pvc -A |grep -E '[a-z0-9]{45}' | awk '{print $5}' | sed 's/i//g' | numfmt --from=iec | tr '\n' '+' | sed 's/+$/\n/g' | bc -l | numfmt --to-unit=$((1024**3))
550

The provider should have calculated its available persistent storage using (AVAIL / REPLICAS) - CLAIMED_BY_PVC) = (898/2)-550 = -101 - meaning it has already been over-provisioned. It is still running well though, because the PVC's weren't 100% filled yet:

Filesystem                         Size  Used Avail Use% Mounted on
/dev/rbd0                          492G  193G  299G  40% /root/.osmosisd
/dev/rbd1                           49G   44K   49G   1% /root

FWIW, the Used space as reported by df reported doesn't mean as much because even after the files get removed off of these rbd devices, they still occupy the data on the disk (data remanence) as only inode gets removed (or its visibility flag).

$ kubectl -n rook-ceph exec -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- bash

[root@rook-ceph-tools-846b5c845b-qrf7c /]# ceph osd pool ls
.mgr
akash-deployments

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd pool stats akash-deployments
Total Images: 2
Total Snapshots: 0
Provisioned Size: 550 GiB

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments ls
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863
csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments disk-usage csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657
warning: fast-diff map is not enabled for csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657. operation may be slow.
NAME                                          PROVISIONED  USED  
csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657       50 GiB  17 GiB

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments disk-usage csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863
warning: fast-diff map is not enabled for csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863. operation may be slow.
NAME                                          PROVISIONED  USED   
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863      500 GiB  469 GiB

469+17 = 486 GiB is actually used disk space by these two PVC's. (by 6Gi more since last time I've issued ceph df above as the data is being written by the apps)

useful ceph commands

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- sh -c 'ceph osd pool ls | while read POOL; do echo "=== pool: $POOL ==="; rbd -p "$POOL" ls | while read VOL; do rbd -p "$POOL" disk-usage "$VOL"; done; done'

=== pool: .mgr ===
=== pool: akash-deployments ===
warning: fast-diff map is not enabled for csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863. operation may be slow.
NAME                                          PROVISIONED  USED   
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863      500 GiB  469 GiB
...
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- sh -c 'ceph osd pool ls | while read POOL; do echo "=== pool: $POOL ==="; rbd -p "$POOL" ls | while read VOL; do ceph osd map "$POOL" "$VOL"; done; done'=== pool: .mgr ===
=== pool: akash-deployments ===
osdmap e2192 pool 'akash-deployments' (2) object 'csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863' -> pg 2.d2caa6be (2.be) -> up ([3,1], p3) acting ([3,1], p3)
...

Europlots provider

Akash Provider calculates the available persistent storage as MAX AVAIL - USED => (9.3-1.1)*1024 = 8396.8 GiB which matches up with the available persistent storage provider reports.

However, in fact the provider should have (AVAIL / REPLICAS) - CLAIMED_BY_PVC) => ((20*1024/2)-2420) => 7820 GiB (or 7.64TiB) of available space.

# kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                        IMAGE
akash-hostname-operator-54854db4c5-c6wvl    ghcr.io/akash-network/provider:0.4.6
akash-inventory-operator-5ff867f6d9-cvx28   ghcr.io/akash-network/provider:0.4.6
akash-ip-operator-79cc857f7b-fj8hd          ghcr.io/akash-network/provider:0.4.6
akash-node-1-0                              ghcr.io/akash-network/node:0.26.2
akash-provider-0                            ghcr.io/akash-network/provider:0.4.7
# kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    21 TiB  20 TiB  1.2 TiB   1.2 TiB       5.55
TOTAL  21 TiB  20 TiB  1.2 TiB   1.2 TiB       5.55

--- POOLS ---
POOL               ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                1    1  6.6 MiB        3   20 MiB      0    6.2 TiB
akash-nodes         2   32     19 B        1    8 KiB      0    9.3 TiB
akash-deployments   3  512  588 GiB  152.59k  1.1 TiB   5.79    9.3 TiB
$ provider_info.sh provider.europlots.com
type       cpu      gpu  ram                 ephemeral           persistent
used       98.15    1    285.83948681596667  1550.8649163246155  2414.4313225746155
pending    0        0    0                   0                   0
available  283.585  1    650.6061556553468   10326.47616339475   8387.475747092627
node       136.015  1    307.38518168684095  4418.74621364288    N/A
node       122.015  0    309.32234382629395  5785.588328568265   N/A
node       25.555   0    33.898630142211914  122.141621183604    N/A

$ cat message\ (7).txt | grep -E '[a-z0-9]{45}' | awk '{print $5}' | sed 's/i//g' | numfmt --from=iec | tr '\n' '+' | sed 's/+$/\n/g' | bc -l | numfmt --to-unit=$((1024**3)) 2420


Europlots is using `6 x 3.84TiB` disks (ceph: 2 replicas) which gives him `(6*3.84)/2` = `11.52 TiB` of available persistent storage space.

Ceph reports `20 TiB` as `AVAIL`, which corresponds to: `6*3.84` = 23.04 TiB - extra padding (the exact disk size / ceph FS metadata)

Taking the no. of replicas into account: `((20*1024)/2)` = `10240 GiB` is the amount of disk space his cluster has to offer for the deployments.

`AVAIL` - `PVC` => `10240-2420` = `7820 GiB` is the actually available space that the Ceph cluster can offer.
However, it reports more based on the previously mentioned formula: `MAX AVAIL` - `USED`  => `(9.3-1.1)*1024` = `8396.8` GiB. Because of that the provider can over-allocate persistent storage space to its deployments. As shown in the example with Hurricane provider.

**UPDATE**:

Here is the actual used space by the PVC on Europlots:

NAME PROVISIONED USED csi-vol-1f2185ec-846d-11ee-8000-be1f38662678 1000 GiB 15 GiB csi-vol-23ba52fa-3a22-11ee-8d82-6aa17b00d99b 16 GiB 348 MiB csi-vol-23bae704-3a22-11ee-8d82-6aa17b00d99b 16 GiB 500 MiB csi-vol-2b2c2567-4c63-11ed-8eaa-ce3b8929bf79 20 GiB 19 GiB csi-vol-2f98a86f-52d1-11ee-8000-be1f38662678 1 GiB 176 MiB csi-vol-2f9bd2a8-52d1-11ee-8000-be1f38662678 2 GiB 460 MiB csi-vol-34de1928-4c05-11ee-8000-be1f38662678 3 GiB 36 MiB csi-vol-72ac18c1-435f-11ee-8d82-6aa17b00d99b 100 MiB 28 MiB csi-vol-72b1851c-435f-11ee-8d82-6aa17b00d99b 5 GiB 52 MiB csi-vol-78845050-72f6-11ee-8000-be1f38662678 500 GiB 26 GiB csi-vol-8f07e2f8-7822-11ee-8000-be1f38662678 250 GiB 137 GiB csi-vol-b042d2fe-6a0f-11ee-8000-be1f38662678 512 MiB 20 MiB csi-vol-d5cb3fc3-5fd5-11ee-8000-be1f38662678 1 GiB 296 MiB csi-vol-d5cdee4c-5fd5-11ee-8000-be1f38662678 1 GiB 968 MiB csi-vol-e754da41-6ce2-11ee-8000-be1f38662678 954 MiB 64 MiB csi-vol-f911f06a-b62b-11ed-82f4-c22bed72523b 10 GiB 1.3 GiB csi-vol-fb244441-0dff-11ee-8425-5af3b7a33171 1 GiB 344 MiB csi-vol-fb286227-0dff-11ee-8425-5af3b7a33171 2 GiB 1.5 GiB csi-vol-fe21fa44-4d7e-11ee-8000-be1f38662678 600 GiB 393 GiB



PROVISIONED: `1000+16+16+20+1+2+3+(100/1024)+5+500+250+(512/1024)+1+1+(954/1024)+10+1+2+600` = `2429 GiB`
USED: `15+(348/1024)+(500/1024)+19+(176/1024)+(460/1024)+(36/1024)+(28/1024)+(52/1024)+26+137+(20/1024)+(296/1024)+(968/1024)+(64/1024)+1.3+(344/1024)+1.5+393` = `596 GiB`

That explains the discrepancy between the actually available disk space `7820 GiB` (`AVAIL` - `PVC`) and what provider reports `8396.8 GiB` (`MAX AVAIL` - `USED`).

`[provider_reported_avail - used_by_PVC]` => `8396.8 - 596` = `7800.8 GiB` total available space.  (slightly less by  `20 GiB` as it the space was consumed within ~an hour-two as I was updating this comment)