Open andy108369 opened 7 months ago
This is the also reason Hurricane provider reports -567 Gi
of persistent storage available:
Akash-Provider currently reports MAX AVAIL
- USED
=> 393-960
=> negative -567 Gi
of available persistent storage.
Clarification:
USED
isSTORED X No_Replicas
here, i.e.480 x 2
=960 Gi
(server has 2x931.5G
disks with two OSD's each465.8G
)
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 1.8 TiB 898 GiB 965 GiB 965 GiB 51.79
TOTAL 1.8 TiB 898 GiB 965 GiB 965 GiB 51.79
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 449 KiB 2 904 KiB 0 393 GiB
akash-deployments 2 256 480 GiB 123.31k 960 GiB 54.98 393 GiB
$ provider_info.sh provider.hurricane.akash.pub
type cpu gpu ram ephemeral persistent
used 58.6 0 169.5 746.5 550
pending 0 0 0 0 0
available 34.295 1 4.681840896606445 1062.2646561246365 -567.2483718525618
node 34.295 1 4.681840896606445 1062.2646561246365 N/A
Ceph config - 2 replicas:
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd pool get akash-deployments all
size: 2
min_size: 2
...
PVC
$ kubectl get pvc -A |grep -E '[a-z0-9]{45}' | awk '{print $5}' | sed 's/i//g' | numfmt --from=iec | tr '\n' '+' | sed 's/+$/\n/g' | bc -l | numfmt --to-unit=$((1024**3))
550
The provider should have calculated its available persistent storage using (AVAIL / REPLICAS) - CLAIMED_BY_PVC)
= (898/2)-550
= -101
- meaning it has already been over-provisioned.
It is still running well though, because the PVC's weren't 100% filled yet:
Filesystem Size Used Avail Use% Mounted on
/dev/rbd0 492G 193G 299G 40% /root/.osmosisd
/dev/rbd1 49G 44K 49G 1% /root
FWIW, the
Used
space as reported bydf
reported doesn't mean as much because even after the files get removed off of these rbd devices, they still occupy the data on the disk (data remanence) as only inode gets removed (or its visibility flag).
$ kubectl -n rook-ceph exec -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- bash
[root@rook-ceph-tools-846b5c845b-qrf7c /]# ceph osd pool ls
.mgr
akash-deployments
[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd pool stats akash-deployments
Total Images: 2
Total Snapshots: 0
Provisioned Size: 550 GiB
[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments ls
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863
csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657
[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments disk-usage csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657
warning: fast-diff map is not enabled for csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657. operation may be slow.
NAME PROVISIONED USED
csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657 50 GiB 17 GiB
[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments disk-usage csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863
warning: fast-diff map is not enabled for csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863. operation may be slow.
NAME PROVISIONED USED
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863 500 GiB 469 GiB
469+17
= 486 GiB
is actually used disk space by these two PVC's. (by 6Gi
more since last time I've issued ceph df
above as the data is being written by the apps)
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- sh -c 'ceph osd pool ls | while read POOL; do echo "=== pool: $POOL ==="; rbd -p "$POOL" ls | while read VOL; do rbd -p "$POOL" disk-usage "$VOL"; done; done'
=== pool: .mgr ===
=== pool: akash-deployments ===
warning: fast-diff map is not enabled for csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863. operation may be slow.
NAME PROVISIONED USED
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863 500 GiB 469 GiB
...
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- sh -c 'ceph osd pool ls | while read POOL; do echo "=== pool: $POOL ==="; rbd -p "$POOL" ls | while read VOL; do ceph osd map "$POOL" "$VOL"; done; done'=== pool: .mgr ===
=== pool: akash-deployments ===
osdmap e2192 pool 'akash-deployments' (2) object 'csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863' -> pg 2.d2caa6be (2.be) -> up ([3,1], p3) acting ([3,1], p3)
...
Akash Provider calculates the available persistent storage as MAX AVAIL
- USED
=> (9.3-1.1)*1024
= 8396.8
GiB which matches up with the available persistent storage provider reports.
However, in fact the provider should have (AVAIL / REPLICAS) - CLAIMED_BY_PVC)
=> ((20*1024/2)-2420)
=> 7820 GiB
(or 7.64TiB
) of available space.
# kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME IMAGE
akash-hostname-operator-54854db4c5-c6wvl ghcr.io/akash-network/provider:0.4.6
akash-inventory-operator-5ff867f6d9-cvx28 ghcr.io/akash-network/provider:0.4.6
akash-ip-operator-79cc857f7b-fj8hd ghcr.io/akash-network/provider:0.4.6
akash-node-1-0 ghcr.io/akash-network/node:0.26.2
akash-provider-0 ghcr.io/akash-network/provider:0.4.7
# kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 21 TiB 20 TiB 1.2 TiB 1.2 TiB 5.55
TOTAL 21 TiB 20 TiB 1.2 TiB 1.2 TiB 5.55
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 6.6 MiB 3 20 MiB 0 6.2 TiB
akash-nodes 2 32 19 B 1 8 KiB 0 9.3 TiB
akash-deployments 3 512 588 GiB 152.59k 1.1 TiB 5.79 9.3 TiB
$ provider_info.sh provider.europlots.com
type cpu gpu ram ephemeral persistent
used 98.15 1 285.83948681596667 1550.8649163246155 2414.4313225746155
pending 0 0 0 0 0
available 283.585 1 650.6061556553468 10326.47616339475 8387.475747092627
node 136.015 1 307.38518168684095 4418.74621364288 N/A
node 122.015 0 309.32234382629395 5785.588328568265 N/A
node 25.555 0 33.898630142211914 122.141621183604 N/A
his Ceph is configured with 2 replicas for the objects
# kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd pool get akash-deployments all
size: 2
min_size: 2
...
the actual reserved space by the PVC is 2420 GiB
(or 2.4 TiB
):
$ cat message\ \(7\).txt | grep -E '[a-z0-9]{45}'
01pil6i48e91fr0k3jlhdakid6sg2q2g2dm2muk343gc8 node-certs-node-0 Bound pvc-cee47c63-a161-48b1-a819-c77df64ab195 100Mi RWO beta3 82d
01pil6i48e91fr0k3jlhdakid6sg2q2g2dm2muk343gc8 postgres-data-postgres-0 Bound pvc-26535f51-7205-4c78-b24f-6af3ef30aed9 5Gi RWO beta3 82d
18mh9uqn9n92nn165jveibikldaoksook0ioug62e0ejo db-wordpress-db-db-0 Bound pvc-6fc61cc9-828c-4312-a880-03a7f5755bcd 1Gi RWO beta3 46d
18mh9uqn9n92nn165jveibikldaoksook0ioug62e0ejo wordpress-wordpress-data-wordpress-0 Bound pvc-95c7733d-4f3d-40f6-93bb-5ae723860603 1Gi RWO beta3 46d
...
...
$ cat message\ (7).txt | grep -E '[a-z0-9]{45}' | awk '{print $5}' | sed 's/i//g' | numfmt --from=iec | tr '\n' '+' | sed 's/+$/\n/g' | bc -l | numfmt --to-unit=$((1024**3)) 2420
Europlots is using `6 x 3.84TiB` disks (ceph: 2 replicas) which gives him `(6*3.84)/2` = `11.52 TiB` of available persistent storage space.
Ceph reports `20 TiB` as `AVAIL`, which corresponds to: `6*3.84` = 23.04 TiB - extra padding (the exact disk size / ceph FS metadata)
Taking the no. of replicas into account: `((20*1024)/2)` = `10240 GiB` is the amount of disk space his cluster has to offer for the deployments.
`AVAIL` - `PVC` => `10240-2420` = `7820 GiB` is the actually available space that the Ceph cluster can offer.
However, it reports more based on the previously mentioned formula: `MAX AVAIL` - `USED` => `(9.3-1.1)*1024` = `8396.8` GiB. Because of that the provider can over-allocate persistent storage space to its deployments. As shown in the example with Hurricane provider.
**UPDATE**:
Here is the actual used space by the PVC on Europlots:
NAME PROVISIONED USED csi-vol-1f2185ec-846d-11ee-8000-be1f38662678 1000 GiB 15 GiB csi-vol-23ba52fa-3a22-11ee-8d82-6aa17b00d99b 16 GiB 348 MiB csi-vol-23bae704-3a22-11ee-8d82-6aa17b00d99b 16 GiB 500 MiB csi-vol-2b2c2567-4c63-11ed-8eaa-ce3b8929bf79 20 GiB 19 GiB csi-vol-2f98a86f-52d1-11ee-8000-be1f38662678 1 GiB 176 MiB csi-vol-2f9bd2a8-52d1-11ee-8000-be1f38662678 2 GiB 460 MiB csi-vol-34de1928-4c05-11ee-8000-be1f38662678 3 GiB 36 MiB csi-vol-72ac18c1-435f-11ee-8d82-6aa17b00d99b 100 MiB 28 MiB csi-vol-72b1851c-435f-11ee-8d82-6aa17b00d99b 5 GiB 52 MiB csi-vol-78845050-72f6-11ee-8000-be1f38662678 500 GiB 26 GiB csi-vol-8f07e2f8-7822-11ee-8000-be1f38662678 250 GiB 137 GiB csi-vol-b042d2fe-6a0f-11ee-8000-be1f38662678 512 MiB 20 MiB csi-vol-d5cb3fc3-5fd5-11ee-8000-be1f38662678 1 GiB 296 MiB csi-vol-d5cdee4c-5fd5-11ee-8000-be1f38662678 1 GiB 968 MiB csi-vol-e754da41-6ce2-11ee-8000-be1f38662678 954 MiB 64 MiB csi-vol-f911f06a-b62b-11ed-82f4-c22bed72523b 10 GiB 1.3 GiB csi-vol-fb244441-0dff-11ee-8425-5af3b7a33171 1 GiB 344 MiB csi-vol-fb286227-0dff-11ee-8425-5af3b7a33171 2 GiB 1.5 GiB csi-vol-fe21fa44-4d7e-11ee-8000-be1f38662678 600 GiB 393 GiB
PROVISIONED: `1000+16+16+20+1+2+3+(100/1024)+5+500+250+(512/1024)+1+1+(954/1024)+10+1+2+600` = `2429 GiB`
USED: `15+(348/1024)+(500/1024)+19+(176/1024)+(460/1024)+(36/1024)+(28/1024)+(52/1024)+26+137+(20/1024)+(296/1024)+(968/1024)+(64/1024)+1.3+(344/1024)+1.5+393` = `596 GiB`
That explains the discrepancy between the actually available disk space `7820 GiB` (`AVAIL` - `PVC`) and what provider reports `8396.8 GiB` (`MAX AVAIL` - `USED`).
`[provider_reported_avail - used_by_PVC]` => `8396.8 - 596` = `7800.8 GiB` total available space. (slightly less by `20 GiB` as it the space was consumed within ~an hour-two as I was updating this comment)
The current reporting of persistent storage available space by the provider, based on Ceph's
MAX AVAIL
, is not accurate.This is due to Ceph's
MAX AVAIL
being a dynamic value that representsMAX - USED
, and it decreases as storage is used. Consequently, this results in the provider sometimes reporting less available space than actually exists.A key point of confusion arises with Kubernetes' PV (Persistent Volume) system. In Kubernetes, when a PV or PVC (Persistent Volume Claim) is created, it doesn't immediately reserve physical space in Ceph. Therefore, Ceph's
MAX AVAIL
doesn't change upon the creation of these volumes, leading to a discrepancy. It's only when data is actually written to these volumes that Ceph'sMAX AVAIL
decreases accordingly.To provide a more accurate view of the available space, the provider should modify its display metrics. Instead of relying on Ceph's
MAX AVAIL
, it should calculate the actual available space as[Total MAX space of Ceph] - [Reserved space in K8s (PV/PVC)]
. Here,Total MAX space of Ceph
should be considered as the entire storage capacity of the Ceph cluster without deducting the Ceph'sUSED
amount (as what Ceph'sMAX AVAIL
does now) or the space reserved by Kubernetes PV/PVC. This approach will give a more realistic representation of the available storage, accounting for the Kubernetes-reserved space.NOTE: Ceph's
USED
is theSTORED x No_Replicas
in Ceph, which means the available persistent storage can easily go negative as soon as more than half of space gets written to the persistent storage (with two replicas), or a quarter of that (with three replicas). See the example case from Hurricane provider is below (two replicas).Tested Provider / Inventory Operator Versions
0.4.6
0.4.7
0.4.8-rc0
Scenario Illustration
10Gi
Persistent StorageMAX AVAIL
from Ceph;10Gi
but quickly reverted (during bid/accepting bid/sending-manifest; so I presume some inner akash-provider mechanics)ceph df
also reportsMAX AVAIL
=>30Gi
.9Gi
of Data to PV9Gi
of DataMAX AVAIL
asMAX - USED
.--- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL akash-nodes 1 32 19 B 1 4 KiB 0 21 GiB akash-deployments 2 32 9.0 GiB 2.33k 9.0 GiB 29.71 21 GiB .mgr 3 1 449 KiB 2 452 KiB 0 21 GiB
$ provider_info.sh provider.provider-02.sandbox-01.aksh.pw type cpu gpu ram ephemeral persistent used 1 1 2 5 10 pending 0 0 0 0 0 available 13.65 1 28.632808685302734 169.28484315704554 12.302638040855527 node 13.65 1 28.632808685302734 169.28484315704554 N/A