Closed andy108369 closed 7 months ago
might get solved by https://github.com/akash-network/support/issues/141 ?
this is already supported by provider codebase as well as clients
the node must be labeled as following (mind ram token) capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
this is already supported by provider codebase as well as clients the node must be labeled as following (mind ram token)
capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
Adding the missing ram
token, should look like so:
akash.network/capabilities.gpu.vendor.nvidia.model.a100
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.80Gi
provider.yaml
):attributes:
...
- key: capabilities/gpu/vendor/nvidia/model/a100
value: true
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
value: true
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
value: true
@troian Unfortunately, this didn't seem to work:
/ram
token as in the previous attempts# kubectl -n akash-services logs akash-provider-0 --tail=10000 | grep 13795882
I[2023-11-22|16:17:33.967] order detected module=bidengine-service cmp=provider order=order/akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.972] group fetched module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
D[2023-11-22|16:17:33.972] unable to fulfill: incompatible attributes for resources requirements module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1 wanted="{Name:akash Requirements:{SignedBy:{AllOf:[] AnyOf:[]} Attributes:[{Key:host Value:akash} {Key:organization Value:foundrydigital}]} Resources:[{Resources:{ID:1 CPU:units:<val:\"1000\" > Memory:quantity:<val:\"1073741824\" > Storage:[{Name:default Quantity:{Val:1073741824} Attributes:[]}] GPU:units:<val:\"1\" > attributes:<key:\"vendor/nvidia/model/a100/40Gi\" value:\"true\" > Endpoints:[{Kind:RANDOM_PORT SequenceNumber:0} {Kind:SHARED_HTTP SequenceNumber:0}]} Count:1 Price:1000000.000000000000000000uakt}]}" have="[{Key:region Value:us-east} {Key:host Value:akash} {Key:tier Value:community} {Key:organization Value:foundrydigital} {Key:location-region Value:na-us-northeast} {Key:email Value:hello@foundrydigital.com} {Key:country Value:US} {Key:website Value:www.foundrydigital.com} {Key:timezone Value:UTC-4} {Key:location-type Value:office} {Key:capabilities/gpu/vendor/nvidia/model/rtx8000 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/v100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi Value:true} {Key:capabilities/gpu Value:nvidia} {Key:capabilities/cpu Value:intel} {Key:capabilities/cpu/arch Value:x86-64} {Key:capabilities/memory Value:ddr4}]"
D[2023-11-22|16:17:33.973] declined to bid module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.973] shutting down module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
provider attributes:
$ provider-services query provider get akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el -o text
attributes:
- key: region
value: us-east
- key: host
value: akash
- key: tier
value: community
- key: organization
value: foundrydigital
- key: location-region
value: na-us-northeast
- key: email
value: hello@foundrydigital.com
- key: country
value: US
- key: website
value: www.foundrydigital.com
- key: timezone
value: UTC-4
- key: location-type
value: office
- key: capabilities/gpu/vendor/nvidia/model/rtx8000
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/v100
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
value: "true"
- key: capabilities/gpu
value: nvidia
- key: capabilities/cpu
value: intel
- key: capabilities/cpu/arch
value: x86-64
- key: capabilities/memory
value: ddr4
host_uri: https://provider.akash.foundrystaking.com:8443
info:
email: ""
website: ""
owner: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
Labels: akash.network/capabilities.gpu.vendor.nvidia.model.a100=true
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true
then we tried removing the akash.network/capabilities.gpu.vendor.nvidia.model.a100
and leaving the akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi
one (and bouncing the akash-provider pod):
$ kubectl describe node/prd-stk-tsr-sdgx-32 | grep -A10 Label
Labels: akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true
didn't help :confused:
Goal
Provider have 40gb a100's on the network and they are adding 80gb a100's too. They're the same model but different VRAM. They are wondering whether to just label everything a100-80gb or so?
Implementation (PoC)
CONFIG
Label the worker node with
40Gi
&80Gi
a100
's as follows:provider.yaml
):provider.yaml
fora100.40Gi
SDL test for requesting
a100-40
GPU:TEST1
Provider thinks there is
insufficient capacity
:Somehow the akash-provider reads
vendor/nvidia/model/a100/40Gi
(notice,ram
isn't there) based on the following client SDL and attempts to evaluate40Gi
as something it doesn't have (value:true
??) which is interesting: