akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

provider incorrectly defaults to the last (dict sorted) GPU model in the SDL model list when forming order request before handing it to the bid price script #139

Open andy108369 opened 11 months ago

andy108369 commented 11 months ago

Environment:

Issue Summary:

The provider, despite supporting the correct GPU model and bidding accordingly, erroneously sets an unsupported GPU model when forming the order request. This error occurs because the provider defaults to the last (dict sorted) GPU model listed in the SDL, which may not be supported or may even be non-existent.

This leads to the bid price script calculating bids based on this incorrect GPU model, resulting in either inaccurate bids or a failure to bid if the provider has not set pricing for this model.

Steps to Reproduce:

  1. Have a provider with some GPU (e.g., a100).
  2. Create an SDL file listing multiple GPU models, placing a non-existent or random models (e.g., - model: akgjkajgksag) and the supported model (a100) further down the list.
  3. Broadcast the SDL to initiate bidding from the provider.
  4. Review the order request and observe that it incorrectly specifies the GPU model from the SDL found last (after dict sorting), e.g., "model": "akgjkajgksag", not the supported a100.
  5. Notice that the bid price script fails to calculate a price due to the absence of pricing for the non-existent model akgjkajgksag.

Expected Behavior:

The provider should identify and select the GPU model it actually supports when forming the order request. This correct model should then be used by the bid price script for price calculation, ignoring any models that are not supported.

Actual Behavior:

The provider incorrectly selects the last (dict sorted) GPU model listed in the SDL for the order request. This misstep leads to the bid price script either not calculating a price or calculating an incorrect price, as it encounters an unsupported or non-existent GPU model.

Example

Provider attributes: supported GPU - a100

$ provider-services query provider get akash1c6rsz4f59nkus3s5qauxxh969j2mtkkn2clk2e -o text
attributes:
...
- key: capabilities/gpu/vendor/nvidia/model/a100
  value: "true"

SDL Contents:

Notice, v100 model here would be the last model when dict (alphabetically) sorted. And a100 is also part of the list so that provider with a100 bids on it.

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: v100
                - model: h100
                - model: a100
                - model: a40
                - model: a16
                - model: t4
                - model: rtx5000
                - model: rtx6000
                - model: a4000
                - model: a5000
                - model: a6000
                - model: 3090
                - model: 3090ti
                - model: 4090

The deployment order Provider forms (before passing it to the bid price script):

As demonstrated, the received order request incorrectly specifies the v100 model (which would be the last when dict sorted from the SDL models list) instead of the a100 model that the provider supports.

{
  "resources": [
    {
      "memory": 107374182400,
      "cpu": 8000,
      "gpu": {
        "units": 1,
        "attributes": {
          "vendor": {
            "nvidia": {
              "model": "v100"
            }
          }
        }
      },
      "storage": [
        {
          "class": "ephemeral",
          "size": 214748364800
        }
      ],
      "count": 1,
      "endpoint_quantity": 1,
      "ip_lease_quantity": 0
    }
  ],
  "price": {
    "denom": "uakt",
    "amount": "100000.000000000000000000"
  },
  "price_precision": 6
}

Additional information

The model provider picks is the last model after dict (alphabetically) sorted.

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: rtx4000
                - model: a1
                - model: a11
                - model: b1
                - model: b11
                - model: z
                - model: z1
                - model: z11
                - model: zz1
                - model: zzz1
                - model: zzz11
                - model: y
                - model: yy
                - model: yyy
                - model: yyy0
                - model: yyyy
                - model: yyyy0
                - model: zzz0
                - model: zzz
                - model: 1
                - model: 11
                - model: 9
                - model: 99999
root@akash-provider-0:/tmp# grep -C3 model akash1nx9pr8jee9jx44tkgt62fmgt2hmgvru92td3hg.log
        "attributes": {
          "vendor": {
            "nvidia": {
              "model": "zzz11"
            }
          }
        }

dict (alphabetical) sorting:

$ cat m | sort -d
1
11
9
99999
a1
a11
b1
b11
rtx4000
y
yy
yyy
yyy0
yyyy
yyyy0
z
z1
z11
zz1
zzz
zzz0
zzz1
zzz11
andy108369 commented 11 months ago

Partial workaround

Developed a partial workaround for the bid price script that sets the GPU price to the highest (out of all set by the provider owner via price_target_gpu_mappings) when GPU model detection method fails due to issue-139.

Follow these steps to upgrade your bid price script:

  1. Get the latest bid price script
wget https://raw.githubusercontent.com/akash-network/helm-charts/main/charts/akash-provider/scripts/price_script_generic.sh
  1. Apply it

Don't forget extra flags if you have used such.
You can use helm -n akash-services get values akash-provider command to see your current values.

helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml --set bidpricescript="$(cat ./price_script_generic.sh | openssl base64 -A)"