akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

Support for requesting max SHM size from SDL #179

Closed anilmurty closed 3 months ago

anilmurty commented 5 months ago

Is your feature request related to a problem? Please describe.

Customers (particularly AI/ ML training workloads) frequently need to be able to have multiple services share storage - for example one service that is downloading data and labeling is CPU bound, while another that uses the data for training is GPU bound and they can run in parallel but need to access large shared memory. We currently don't allow the max SHM size to be controllable by the user which makes it hard to run such workloads.

Describe the solution you'd like

Support being able to specify and request SHM size as part of the SDL

Describe alternatives you've considered

  1. Manually applying it on the provider: This is the workaround we have been pursuing so far but it's painful because it needs to be done every time a new deployment is done or the deployment restarts for some reason. Also this requires coordination with the provider who may not be in the same TZ as the tenant. Note that we have tested being able to apply these changes on the provider side manually during our work with Thumper training on the FoundryStaking provider.

Search

Code of Conduct

Additional context

No response

anilmurty commented 4 months ago

Per Feb 20 call: We are leaning towards implementing full support for SHM (not just the workaround with bid attributes). @boz is planning to take this on (Thanks Adam!)

anilmurty commented 4 months ago

In the interim @troian and @chainzero are going to look into the workaround with using bid attributes + a daemon running on the provider that checks the attributes and applies SHM using kubectl commands

brewsterdrinkwater commented 4 months ago

March 4th, 2024:

brewsterdrinkwater commented 3 months ago

March 12th, 2024:

Does not need a network upgrade. No SDL changes.

andy108369 commented 3 months ago

akash network 0.32.2 provider-services 0.5.9

shm doesn't seem to be working yet.

provider attributes

$ provider-services query provider get akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk -o text
attributes:
- key: host
  value: akash
- key: organization
  value: overclock
- key: datacenter
  value: hurricane
- key: capabilities/gpu/vendor/nvidia/model/t4
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi/interface/pcie
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/interface/pcie
  value: "true"
- key: capabilities/storage/1/class
  value: default
- key: capabilities/storage/1/persistent
  value: "true"
- key: capabilities/storage/2/class
  value: beta3
- key: capabilities/storage/2/persistent
  value: "true"
- key: capabilities/storage/3/class
  value: ram
- key: capabilities/storage/3/persistent
  value: "false"
- key: ip-lease
  value: "true"
host_uri: https://provider.hurricane.akash.pub:8443
info:
  email: hosting@ovrclk.com
  website: https://akash.network
owner: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk

SDL

---
version: "2.0"

services:
  ssh:
    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII andrey@akash.network'
    command:
      - "sh"
      - "-c"
    args:
      - 'apt-get update;
      apt-get install -y --no-install-recommends -- ssh;
      mkdir -p -m0755 /run/sshd;
      mkdir -m700 ~/.ssh;
      echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
      chmod 0600 ~/.ssh/authorized_keys;
      ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
      md5sum ~/.ssh/authorized_keys;
      exec /usr/sbin/sshd -D'
    params:
      storage:
        shm:
          mount: /dev/shm
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
      # SSH
      - port: 22
        as: 22
        to:
          - global: true

profiles:
  compute:
    ssh:
      resources:
        cpu:
          units: 1
        memory:
          size: 4Gi
        storage:
          - size: 10Gi
          - name: shm
            size: 2Gi
            attributes:
              class: ram
  placement:
    akash:
      attributes:
        host: akash
        #organization: someorg
      #signedBy:
      #  anyOf:
      #    - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
      pricing:
        ssh:
          denom: uakt
          amount: 1000000

deployment:
  ssh:
    akash:
      profile: ssh
      count: 1

after send-manifest:

E[2024-03-30|21:36:59.567] applying deployment                          cmp=provider client=kube err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm"" lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk service=ssh
E[2024-03-30|21:36:59.567] unable to deploy lid=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk. last known state:
cmp=provider client=kube
E[2024-03-30|21:36:59.567] deploying workload                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""
E[2024-03-30|21:36:59.567] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash state=deploy-active err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""

@troian

andy108369 commented 3 months ago

SDL (pers.volume + /dev/shm)

In the case of two volumes - pers.volume + shm (ram) I'm getting "manifest version validation failed" from provider.

SDL:

---
version: "2.0"

services:
  ssh:
    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII andrey@akash.network'
    command:
      - "sh"
      - "-c"
    args:
      - 'apt-get update;
      apt-get install -y --no-install-recommends -- ssh;
      mkdir -p -m0755 /run/sshd;
      mkdir -m700 ~/.ssh;
      echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
      chmod 0600 ~/.ssh/authorized_keys;
      ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
      md5sum ~/.ssh/authorized_keys;
      exec /usr/sbin/sshd -D'
    params:
      storage:
        data:
          mount: /root
        shm:
          mount: /dev/shm
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
      # SSH
      - port: 22
        as: 22
        to:
          - global: true

profiles:
  compute:
    ssh:
      resources:
        cpu:
          units: 1
        memory:
          size: 4Gi
        storage:
          - size: 10Gi
          - name: data
            size: 5Gi
            attributes:
              persistent: true
              class: beta3
          - name: shm
            size: 2Gi
            attributes:
              class: ram
  placement:
    akash:
      attributes:
        host: akash
        #organization: someorg
      #signedBy:
      #  anyOf:
      #    - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
      pricing:
        ssh:
          denom: uakt
          amount: 1000000

deployment:
  ssh:
    akash:
      profile: ssh
      count: 1

Client:

provider-services 0.5.9


arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][]$ akash_deploy ssh-shm-and-pers.yaml
INFO: Broadcasting 'provider-services deployment create -y --deposit 500000uakt -- ssh-shm-and-pers.yaml' transaction...
INFO: Waiting for the TX 1CCD212E8E216E23A168B32C922CDAED988C9710F2C15FF5FE601A61B6069BAB to get processed by the Akash network
INFO: Success

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077--1]$ akash_accept rate monthly usd dseq/gseq/oseq provider host 0> 1.00 0.42 $2.05 15663077/1/1 akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk provider.hurricane.akash.pub:8443
Choose your bid from the list [0]: 0 INFO: Accepting the bid offered by akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk provider for 15663077/1/1 deployment INFO: Broadcasting 'provider-services market lease create -y' transaction... INFO: Waiting for the TX 8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4 to get processed by the Akash network akINFO: Success 8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4

arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077-1-1]$ akash_send_manifest ssh-shm-and-pers.yaml Detected provider for 15663077/1/1: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk [{"provider":"akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk","status":"FAIL","error":"remote server returned 500","errorMessage":"manifest version validation failed\n"}] Error: submit manifest to some providers has been failed ERROR: provider-services send-manifest failed with '1' code.


Provider (v0.5.9):

$ kubectl -n akash-services logs akash-provider-0 --tail=100 -f | grep -Evi 'check|result|IP|replicas|dump' Defaulted container "provider" out of: provider, init (init) I[2024-03-30|21:47:55.201] order detected module=bidengine-service cmp=provider order=order/akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 I[2024-03-30|21:47:55.203] group fetched module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 I[2024-03-30|21:47:55.203] requesting reservation module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 D[2024-03-30|21:47:55.203] reservation requested. order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1, resources=[{"resource":{"id":1,"cpu":{"units":{"val":"1000"}},"memory":{"size":{"val":"4294967296"}},"storage":[{"name":"shm","size":{"val":"2147483648"},"attributes":[{"key":"class","value":"ram"},{"key":"persistent","value":"false"}]},{"name":"data","size":{"val":"5368709120"},"attributes":[{"key":"class","value":"beta3"},{"key":"persistent","value":"true"}]},{"name":"default","size":{"val":"10737418240"}}],"gpu":{"units":{"val":"0"}},"endpoints":[{"kind":1,"sequence_number":0},{"sequence_number":0}]},"count":1,"price":{"denom":"uakt","amount":"1000000.000000000000000000"}}] module=provider-cluster cmp=provider cmp=service cmp=inventory-service D[2024-03-30|21:47:55.203] reservation count module=provider-cluster cmp=provider cmp=service cmp=inventory-service cnt=1 I[2024-03-30|21:47:55.203] Reservation fulfilled module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 D[2024-03-30|21:47:55.205] submitting fulfillment module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 price=1.000000000000000000uakt

I[2024-03-30|21:48:01.322] bid complete module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1

I[2024-03-30|21:48:13.520] lease won module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk I[2024-03-30|21:48:13.520] shutting down module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 I[2024-03-30|21:48:13.520] lease won module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk I[2024-03-30|21:48:13.520] new lease module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk D[2024-03-30|21:48:13.521] watchdog start module=provider-manifest cmp=provider leaseID=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk I[2024-03-30|21:48:13.525] data received module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3

I[2024-03-30|21:48:20.899] watchdog done module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 I[2024-03-30|21:48:20.899] manifest received module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 I[2024-03-30|21:48:20.901] data received module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3 I[2024-03-30|21:48:20.901] deployment version mismatch module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 expected=9D34F853C8A02E32ABD64AAEC0900C67ABBDF9BE5584177C33753198638D8AB3 got=95A422D963420A7C974C8F8B8EC0569CDA801EFB2B1C3595634F5891B5030A4E E[2024-03-30|21:48:20.901] invalid manifest: %s module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 manifestversionvalidationfailed=(MISSING) D[2024-03-30|21:48:20.901] requests valid module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 num-requests=0 E[2024-03-30|21:48:20.901] manifest submit failed cmp=provider err="manifest version validation failed"

andy108369 commented 3 months ago

I've tested the provider-services 0.5.11 - everything is working there.

Details https://github.com/akash-network/helm-charts/pull/268