akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

`lease-shell` breaks with `remote server returned 404` once provider service gets restarted (`.manifest.deployments` track breaks as well) #87

Closed andy108369 closed 1 month ago

andy108369 commented 1 year ago

lease-shell breaks with remote server returned 404 once provider service gets restarted.

.manifest.deployments track breaks as well.

internally tracked https://github.com/ovrclk/engineering/issues/538

This issue appeared in akash 0.16.4 through provider-services 0.2.1.

This issue gets resolved if I revert this commit https://github.com/akash-network/node/commit/1ab8ee6ebd1321d98fb899c8661316cf182a4d4d

looks like the ctx is not getting updated with the active leases (upon provider restart) for IsActive to work.


This commit might be also related to manifest.deployments is reporting 0 now (or mainnet4 upgrade-related [provider-services 0.1.0]):

$ curl -sk https://provider.provider-2.prod.ewr1.akash.pub:8443/status | jq '.manifest.deployments'
0

$ curl -sk https://provider.provider-2.prod.ewr1.akash.pub:8443/status | jq '.cluster.inventory.active | length'
60

$ curl -sk https://provider.provider-2.prod.ewr1.akash.pub:8443/status | jq '.cluster.leases'
60

Update: 23 Jan 2023

Akash Provider reports:

andy108369 commented 11 months ago

workarounds

One can simply add openssh server to their deployment and their public keys to keep a permanent SSH access to the deployment.

For Ubuntu-based image

Make sure to set your public ssh key in SSH_PUBKEY

    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-rsa AAAAB3NzaC1yc...'
    command:
      - sh
      - -c
      - |
        apt-get update
        apt-get install -y --no-install-recommends -- tini ssh
        mkdir -p -m0755 /run/sshd
        mkdir -m700 ~/.ssh
        echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys
        chmod 0600 ~/.ssh/authorized_keys
        cat /proc/1/environ |xargs -0 -n1 | tee -a /etc/environment
        /usr/sbin/sshd
        exec /usr/bin/tini -- tail -f /dev/null
    expose:
      # HTTP/HTTPS port
      - port: 80
        as: 80
        to:
          - global: true
      # SSH port
      - port: 22
        as: 22
        to:
          - global: true

Ollama + SSHD example

https://gist.githubusercontent.com/andy108369/b633153179e08cae4115957a2d294643/raw/888e0b9ccb713d81c3e05d23a1e533323bc2a080/ollama-ssh.yaml

For alpine-based image

Make sure to set your public ssh key in SSH_PUBKEY

    image: alpine:3.18.4
    env:
      - 'SSH_PUBKEY=ssh-rsa AAAAB3NzaC1yc...'
    command:
      - sh
      - -c
      - |
        apk update
        apk add tini openssh-server
        ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ""
        ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key -N ""
        mkdir -m700 ~/.ssh
        echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys
        chmod 0600 ~/.ssh/authorized_keys
        cat /proc/1/environ |xargs -0 -n1 | tee -a /etc/environment
        /usr/sbin/sshd
        exec /sbin/tini -- tail -f /dev/null
    expose:
      # HTTP/HTTPS port
      - port: 80
        as: 80
        to:
          - global: true
      # SSH port
      - port: 22
        as: 22
        to:
          - global: true

And to combine the sshd dameon with running the app(s), one can simply add them one by one:

      app1 &
      app2 &
      exec /usr/sbin/sshd -D

To figure what one has to run (and how) in a specific image:

docker pull <image>
docker image history <image> --no-trunc --format '{{.CreatedBy}}' | grep -E '^WORKDIR|^ENTRYPOINT|^CMD|^USER'
SGC41 commented 9 months ago

Would be nice with a fix for this... a lot of customers, have a bad experience because of it.

anilmurty commented 8 months ago

Added this to the "Up Next" list on the product/ eng roadmap https://github.com/orgs/akash-network/projects/5/views/1

rekpero commented 8 months ago

Hey team, fixing this issue quickly would really help us out at Spheron. We've got a bunch of users struggling to connect shell for their keys or to check status, and it's becoming a bit of a headache. Could we get this sorted out as soon as possible? We're more than happy to give it a test run even before it goes live on the main provider code. Thanks a bunch for jumping on this quickly!

brewsterdrinkwater commented 6 months ago

April 2nd, 2024

andy108369 commented 1 month ago

Provider 0.6.4 fixed this issue! :rocket: We'll be rolling the update ASAP.