akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

Feature Request: support GPU selection based on additional resources (e.g. available VRAM) #148

Closed andy108369 closed 7 months ago

andy108369 commented 10 months ago

the feature was requested by Zach from Foundry in October 2023
I've tested the following on akash 0.26.1, provider 0.4.6.

Goal

Provider have 40gb a100's on the network and they are adding 80gb a100's too. They're the same model but different VRAM. They are wondering whether to just label everything a100-80gb or so?

Implementation (PoC)

CONFIG

Label the worker node with 40Gi & 80Gi a100's as follows:

akash.network/capabilities.gpu.vendor.nvidia.model.a100
akash.network/capabilities.gpu.vendor.nvidia.model.a100.40Gi
akash.network/capabilities.gpu.vendor.nvidia.model.a100.80Gi
attributes:
...
  - key: capabilities/gpu/vendor/nvidia/model/a100
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/40Gi
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/80Gi
    value: true
price_target_gpu_mappings:  "a100=950,a100.40Gi=900,v100=350,rtx-8000=450,*=950"

SDL test for requesting a100-40 GPU:

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: a100
                  ram: 40Gi

TEST1

Provider thinks there is insufficient capacity:

D[2023-10-24|16:35:21.742] reservation requested                        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1 resources[{resource:{id:1,cpu:{units:{val:1000}},memory:{size:{val:1073741824}},storage:[{name:default,size:{val:1073741824}}],gpu:{units:{val:1},attributes:[{key:vendor/nvidia/model/a100/40Gi,value:true}]},endpoints:[{kind:1,sequence_number:0},{sequence_number:0}]},count:1,price:{denom:uakt,amount:1000000.000000000000000000}}]=(MISSING)
I[2023-10-24|16:35:21.742] insufficient capacity for reservation        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1
E[2023-10-24|16:35:21.742] reserving resources                          module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1 err="insufficient capacity"

Somehow the akash-provider reads vendor/nvidia/model/a100/40Gi (notice, ram isn't there) based on the following client SDL and attempts to evaluate 40Gi as something it doesn't have (value:true ??) which is interesting:

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: a100
                  ram: 40Gi
andy108369 commented 10 months ago

might get solved by https://github.com/akash-network/support/issues/141 ?

troian commented 10 months ago

this is already supported by provider codebase as well as clients the node must be labeled as following (mind ram token) capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi

andy108369 commented 10 months ago

this is already supported by provider codebase as well as clients the node must be labeled as following (mind ram token) capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi

Adding the missing ram token, should look like so:

akash.network/capabilities.gpu.vendor.nvidia.model.a100
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.80Gi
attributes:
...
  - key: capabilities/gpu/vendor/nvidia/model/a100
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
    value: true
andy108369 commented 10 months ago

@troian Unfortunately, this didn't seem to work:

# kubectl -n akash-services logs akash-provider-0 --tail=10000 | grep 13795882
I[2023-11-22|16:17:33.967] order detected                               module=bidengine-service cmp=provider order=order/akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.972] group fetched                                module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
D[2023-11-22|16:17:33.972] unable to fulfill: incompatible attributes for resources requirements module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1 wanted="{Name:akash Requirements:{SignedBy:{AllOf:[] AnyOf:[]} Attributes:[{Key:host Value:akash} {Key:organization Value:foundrydigital}]} Resources:[{Resources:{ID:1 CPU:units:<val:\"1000\" >  Memory:quantity:<val:\"1073741824\" >  Storage:[{Name:default Quantity:{Val:1073741824} Attributes:[]}] GPU:units:<val:\"1\" > attributes:<key:\"vendor/nvidia/model/a100/40Gi\" value:\"true\" >  Endpoints:[{Kind:RANDOM_PORT SequenceNumber:0} {Kind:SHARED_HTTP SequenceNumber:0}]} Count:1 Price:1000000.000000000000000000uakt}]}" have="[{Key:region Value:us-east} {Key:host Value:akash} {Key:tier Value:community} {Key:organization Value:foundrydigital} {Key:location-region Value:na-us-northeast} {Key:email Value:hello@foundrydigital.com} {Key:country Value:US} {Key:website Value:www.foundrydigital.com} {Key:timezone Value:UTC-4} {Key:location-type Value:office} {Key:capabilities/gpu/vendor/nvidia/model/rtx8000 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/v100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi Value:true} {Key:capabilities/gpu Value:nvidia} {Key:capabilities/cpu Value:intel} {Key:capabilities/cpu/arch Value:x86-64} {Key:capabilities/memory Value:ddr4}]"
D[2023-11-22|16:17:33.973] declined to bid                              module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.973] shutting down                                module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1

provider attributes:

$ provider-services query provider get akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el -o text
attributes:
- key: region
  value: us-east
- key: host
  value: akash
- key: tier
  value: community
- key: organization
  value: foundrydigital
- key: location-region
  value: na-us-northeast
- key: email
  value: hello@foundrydigital.com
- key: country
  value: US
- key: website
  value: www.foundrydigital.com
- key: timezone
  value: UTC-4
- key: location-type
  value: office
- key: capabilities/gpu/vendor/nvidia/model/rtx8000
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/v100
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
  value: "true"
- key: capabilities/gpu
  value: nvidia
- key: capabilities/cpu
  value: intel
- key: capabilities/cpu/arch
  value: x86-64
- key: capabilities/memory
  value: ddr4
host_uri: https://provider.akash.foundrystaking.com:8443
info:
  email: ""
  website: ""
owner: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
Labels:             akash.network/capabilities.gpu.vendor.nvidia.model.a100=true
                    akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true

then we tried removing the akash.network/capabilities.gpu.vendor.nvidia.model.a100 and leaving the akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi one (and bouncing the akash-provider pod):

$ kubectl describe node/prd-stk-tsr-sdgx-32 | grep -A10 Label
Labels:             akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true

didn't help :confused: