SDL with GPU + no GPU deployment restarts akash-provider

zJuuu commented 1 year ago

Describe the bug SDL with GPU + no GPU deployment restarts akash-provider and all deployments.

SDL:

---
version: "2.0"

services:
  obtaingpu:
    image: ubuntu:22.04
    command:
      - "sh"
      - "-c"
    args:
      - 'uptime;
        nvidia-smi;
        sleep infinity'
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
  nogpu:
    image: ubuntu:22.04
    command:
      - "sh"
      - "-c"
    expose:
      - port: 8080
        as: 80
        to:
          - global: true

profiles:
  compute:
    obtaingpu:
      resources:
        cpu:
          units: 1.0
        memory:
          size: 1Gi
        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
        storage:
          size: 1Gi
    nogpu:
      resources:
        cpu:
          units: 1.0
        memory:
          size: 1Gi
        storage:
          size: 1Gi
  placement:
    akash:
      pricing:
        obtaingpu: 
          denom: uakt
          amount: 10000000
        nogpu: 
          denom: uakt
          amount: 10000000

deployment:
  obtaingpu:
    akash:
      profile: obtaingpu
      count: 1
  nogpu:
    akash:
      profile: nogpu
      count: 1

To Reproduce Steps to reproduce the behavior:

Deploy the sdl on your provider
Run kubectl get pods -A on your provider
See that the akash-provider restarted

Expected behavior Deployment of SDL

Additional context Provider Logs: https://transfer.sh/f7828tQGGg/provider-logs.tar.gz

troian commented 1 year ago

@zJuuu what provider version are you running?

zJuuu commented 1 year ago

@troian v0.3.1-rc0

zJuuu commented 1 year ago

updated to v0.3.1-rc1 but restarted again after deploying

troian commented 1 year ago

upload logs after with v0.3.1-rc1

zJuuu commented 1 year ago

https://transfer.sh/1PCPoeY9pj/provider-logs2.tar.gz

troian commented 1 year ago

@zJuuu mind to dump a manifest object for this lease please

chainzero commented 1 year ago

Confirmed that this issue only occurs when using Cloudmos to create deployments. Akash CLI and Console deployments do not encounter issue.

Example problem deployment via Cloudmos:

DSEQ - 278612
Provider - akash12ayncdl3lmln06rcs6falgkh2y4c5l7lp5eftn
Deployer account - akash1f53fp8kk470f7k26yr5gztd9npzpczqv4ufud7
Provider version - 0.3.1-rc1
Deployed via Cloudmos
Provider logs - https://transfer.sh/okq6q9eo3c/provider-logs.tar.gz
Manifest is not available on provider as the provider pod crashes:

root@node1:~/provider# kubectl -n lease get manifests --show-labels
No resources found in lease namespace.

Manifest used file used to create deployment:

https://gist.github.com/chainzero/9bc3c108b02987e53330a6ff94ce9ec7

akash-network / support

SDL with GPU + no GPU deployment restarts akash-provider #104