cvpfus commented 3 months ago

Grok on Akash Network

Grok repository: https://github.com/xai-org/grok-1

This deployment uses 4 CPU and 8 GPU (using H100 each). If you are trying to use 1 GPU would result to an error. Currently, this deployment requires /dev/shm to be enabled by the provider or this error will occur:

OSError: [Errno 28] No space left on device: './checkpoints/ckpt-0/tensor00000_000' -> '/dev/shm/tmp238nenvh' error

Some modifications:

Uses jax[cuda12_pip]==0.4.23 instead of jax[cuda12_pip]==0.4.25
Models downloaded from huggingface instead of torrent for faster download

gosuri commented 3 months ago

I'm testing this out

andy108369 commented 3 months ago

/dev/shm is part of the checkpoint.py code. I think it should be pretty straightforward to sed /dev/shm for /root/fake_shm in the SDL

https://github.com/xai-org/grok-1/blob/e50578b5f50e4c10c6e7cff31af1ef2bedb3beb8/checkpoint.py#L43-L49

alternatively, can probably try mounting a persistent volume over /dev/shm directory.

andy108369 commented 3 months ago

added a workaround for /dev/shm (=> /root/shm):

        mkdir /root/shm;
        sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;

[WIP] testing updated SDL right now [WIP]

---
version: "2.0"
services:
  app:
    image: nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
    expose:
      - port: 8080
        as: 80
        proto: tcp
        to:
          - global: true
    command:
      - bash
      - "-c"
    args:
      - >-
        apt-get update ; apt-get upgrade -y ;
        apt-get install pip wget git -y;
        pip install dm_haiku==0.0.12;
        pip install jax[cuda12_pip]==0.4.25 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
        pip install numpy==1.26.4;
        pip install sentencepiece==0.2.0;
        pip install -U "huggingface_hub[cli]";
        git clone https://github.com/xai-org/grok-1;
        wget https://github.com/yudai/gotty/releases/download/v2.0.0-alpha.3/gotty_2.0.0-alpha.3_linux_amd64.tar.gz;
        tar -zxvf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; chmod +x gotty ; rm -rf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; mv gotty /usr/local/bin/;
        huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/tensor* --local-dir /grok-1/checkpoints --local-dir-use-symlinks False;
        mv /grok-1/checkpoints/ckpt /grok-1/checkpoints/ckpt-0;
        mkdir /root/shm;
        sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
        cd /grok-1 && gotty -w python3 ./run.py;
        sleep infinity
profiles:
  compute:
    app:
      resources:
        cpu:
          units: 58
        memory:
          size: 1280Gi
        storage:
          size: 1024Gi
        gpu:
          units: 8
          attributes:
            vendor:
              nvidia:
                - model: h100
  placement:
    akash:
      pricing:
        app:
          denom: uakt
          amount: 10000000
deployment:
  app:
    akash:
      profile: app
      count: 1

Update 1:

it appears it needs RAM of: [amount of GPU VRAM] x [GPU count] which makes it 80 x 8 = 640 GiB of RAM at least with 8x h100's.
it requires at least 300+ GiB of disk space for /grok-1 (and the checkpoints) and at least 15 GiB under /root for huggingface cache.

vpavlin commented 3 months ago

Would it make more sense to replace ; with && in args to make the entrypoint pail when a step fails?

baktun14 commented 3 months ago

Could you also add in the root readme under AI - GPU like this please? It will make it importable automatically in Cloudmos/Akash Console.

Grok

andy108369 commented 3 months ago

Would it make more sense to replace ; with && in args to make the entrypoint pail when a step fails?

Thanks! This is just PoC so the code gonna look very dirty; the goal is to make it run first ;) As I am testing it right now it has failed multiple times either due to networking DNS issue or huggingface was unable to get the checkpoints for some period of time.

andy108369 commented 3 months ago

Looks like no go without the proper /dev/shm mounted as tmpfs (i.e. mounting a persistent volume won't do) :/

E0318 16:47:55.478626    4452 pjrt_stream_executor_client.cc:2804] Execution of replica 0 failed: INTERNAL: external/xla/xla/service/gpu/nccl_api.cc:501: NCCL operation ncclGroupEnd() failed: unhandled system error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Error while creating shared memory segment /dev/shm/nccl-qp7mSW (size 9637888)'.
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/grok-1/./run.py", line 72, in <module>
    main()
  File "/grok-1/./run.py", line 67, in main
    print(f"Output for prompt: {inp}", sample_from_model(gen, inp, max_len=100, temperature=0.01))
  File "/grok-1/runners.py", line 597, in sample_from_model
    next(server)
  File "/grok-1/runners.py", line 481, in run
    rngs, last_output, memory, settings = self.prefill_memory(
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/xla/xla/service/gpu/nccl_api.cc:501: NCCL operation ncclGroupEnd() failed: unhandled system error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Error while creating shared memory segment /dev/shm/nccl-qp7mSW (size 9637888)'.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

andy108369 commented 3 months ago

Update 3

It seem to be getting past-thru the /dev/shm related error when I update the deployment on the host to support /dev/shm of a larger size.

Not sure if that's the final phase, but it "hangs" at this point of screenshot (uses 800% cpu threads):

FWIW: It needed about 50+ GiB of /dev/shm ; and /dev/shm had to be SHM; not just a filesystem mounted over /dev/shm

Explanation on `/dev/shm`

In Kubernetes, the default size of /dev/shm is set to 64MiB, and currently, there's no direct support to change this. As a workaround, Kubernetes users can set the Memory as the medium for mounting /dev/shm from the host. Additionally, they should adjust the sizeLimit to 50% of the Pod's requested memory limit (the memory directive in Akash's SDL or resources.limits.memory in the Kubernetes)

However, this workaround isn't applicable in Akash (Yet!), as it doesn't yet support custom storage settings (emptyDir->medium: Memory). In this case, only Akash Providers have the ability to manually adjust these settings. They can do so if they have access to specific deployment details like the correct Deployment Sequence (DSEQ) and the owner's address.

There are also certain drawbacks of using the emptyDir, medium: Memory workaround, see Disadvantages of using empty dir in https://www.sobyte.net/post/2022-04/k8s-pod-shared-memory/ for more details.

Refs.

Example

    volumeMounts:
    - mountPath: /dev/shm
      name: shm
  volumes:
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 512Mi  # if Pod requests memory of 1024 MiB; I.e. `sizeLimit` should be adjusted to 50% of the Pod's requested memory limit (the `memory` directive in Akash's SDL or `resources.limits.memory` in the Kubernetes)

In Grok case I've used 640Gi sizeLimit because my deployment requested 1280Gi RAM in SDL

andy108369 commented 3 months ago

Update 4 - python processes exitted eventually

Not much of the logs:

I can quickly restart the process now as I am in the pod:

pkill gotty
cd /grok-1
gotty -w python3 ./run.py

andy108369 commented 3 months ago

Opened a thread there, seeking for help https://github.com/xai-org/grok-1/issues/164

Upd1: so it segfaults https://github.com/xai-org/grok-1/issues/164#issuecomment-2004922821

andy108369 commented 3 months ago

PyTorch-based version (~590 GiB) is working on Akash. No need to tweak /dev/shm at all.

Details here: https://github.com/xai-org/grok-1/issues/164#issuecomment-2015507877

How-to here: https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953

SDL I've used:

make sure to change SSH public key to yours. also can probably reduce the CPU down to 8 CPU and RAM down to 32 or 64 GiB (I've seen it was spiking only up to about 26 GiB - the PyTorch version)

---
version: "2.0"

services:
  grok-1:
    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCyJDv8e1KMytZI+tQTxvEqrAm5TTvNx8E3VM499Yh1vU13F11z5FabgDiYb4n6hIY2tfTf1Wi+6wwd7/xO0cmIaQ9lRXftbR8Bx9sw+tc9oomRulZZx8pxKYFp7m7ETwtPlR4GY7dHboxKu+6yxaBsTyXu4GkSAW/Q9fN3BLZnZavQMQiUPtJ2w65dIScx/OrxY2Ua203wYzTqy2tKGnz9iGK2RZusb/1/JmSoqVRKMuynAp9iB99TL2uqUbQzTqsqRtoplA6DyiFGRkv1cUKNHZFucnmFEEqgwg56tCg+6KC84e3RTOaKh+hWcms3ossJCG1N4n4D6MKLx2zcnjakLDUKwCXH4FsTzv/CMygH2YEEdGlgSQMLkABqyl6J3j0yEOa+F7y+Tqq9wllipGw/SlPf2wLnpN2V6vR/ZVVRXLuWKZ1Crg7y/pYLID5GOwr8Qg/PhOQyfjJCQE0HK/9aKsqPZ4wze0Hp66P3q1LL1d7S221DodYE6PJfnVcogp8= andrey@stealth'
    command:
      - "sh"
      - "-c"
    args:
      - 'apt-get update;
      apt-get install -y --no-install-recommends -- ssh speedtest-cli netcat-openbsd curl wget ca-certificates jq less iproute2 iputils-ping vim bind9-dnsutils nginx;
      mkdir -p -m0755 /run/sshd;
      mkdir -m700 ~/.ssh;
      echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
      chmod 0600 ~/.ssh/authorized_keys;
      ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
      md5sum ~/.ssh/authorized_keys;
      exec /usr/sbin/sshd -D'
    expose:
      - port: 8080
        as: 80
        to:
          - global: true
      - port: 22
        as: 22
        to:
          - global: true

profiles:
  compute:
    grok-1:
      resources:
        cpu:
          units: 128
        memory:
          size: 1280Gi
        storage:
          size: 2048Gi
        gpu:
          units: 8
          attributes:
            vendor:
              nvidia:
                - model: h100
  placement:
    akash:
      attributes:
        #host: akash
        #organization: overclock
      pricing:
        grok-1:
          denom: uakt
          amount: 1000000

deployment:
  grok-1:
    akash:
      profile: grok-1 
      count: 1

cvpfus commented 3 months ago

that's good. i will also try using this: https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/grok-1

cvpfus commented 3 months ago

I have updated the SDL but not tested yet because I can't get the bidders when trying to deploy it, will try later. I also have uploaded the dockerfile.

Here is the SDL:

---
version: "2.0"
services:
  app:
    image: cvpfus/grok-akash:0.6
    expose:
      - port: 8080
        as: 80
        proto: tcp
        to:
          - global: true
profiles:
  compute:
    app:
      resources:
        cpu:
          units: 64
        memory:
          size: 640Gi
        storage:
          size: 2048Gi
        gpu:
          units: 8
          attributes:
            vendor:
              nvidia:
                - model: h100
  placement:
    akash:
      attributes:
        host: akash
      pricing:
        app:
          denom: uakt
          amount: 1000000
deployment:
  app:
    akash:
      profile: app
      count: 1

cvpfus commented 3 months ago

I'm testing this out

Hi, just to confirm that my twitter handle is cvpfus_id. This is my akash wallet. Thanks!

akash19qhrxhz275t9trslwsp95nz33ry6tlgt8lpgwk

cvpfus commented 3 months ago

Update

Here is recent SDL I used and sometimes it's working and sometimes it's not (stuck when loading the model). According to this, loading the model takes twice the size of the model in RAM. The model size is about 590GB so maybe increasing the RAM to about 1536Gi might solve that (not tried yet, because I'm not getting the bidders when deploying on Akash, will try when it's back normal).

---
version: "2.0"
services:
  app:
    image: cvpfus/grok-akash:0.19
    env:
      - MAX_NEW_TOKENS=100
    expose:
      - port: 8080
        as: 80
        proto: tcp
        to:
          - global: true
profiles:
  compute:
    app:
      resources:
        cpu:
          units: 64
        memory:
          size: 640Gi
        storage:
          size: 2048Gi
        gpu:
          units: 8
          attributes:
            vendor:
              nvidia:
                - model: h100
  placement:
    akash:
      attributes:
        host: akash
      pricing:
        app:
          denom: uakt
          amount: 1000000
deployment:
  app:
    akash:
      profile: app
      count: 1

cvpfus commented 3 months ago

I recorded when it worked

https://github.com/akash-network/awesome-akash/assets/47532266/faae1fcf-389d-4275-978e-900dad3200af

andy108369 commented 3 months ago

Please do not use this image (or any xai-org's grok-1 image) on H100's ! It still locks up the latest nvidia drivers 550.54.15 which then forces us to reboot these nodes.

Details https://github.com/xai-org/grok-1/issues/164#issuecomment-2022572399

gosuri commented 3 months ago

I'm testing this out

Hi, just to confirm that my twitter handle is cvpfus_id. This is my akash wallet. Thanks!

akash19qhrxhz275t9trslwsp95nz33ry6tlgt8lpgwk

Thank you @yusufpraditya and congrats! Here's you 1,000 AKT

akash-network / awesome-akash

Grok deployment on Akash Network #507

Grok on Akash Network

[WIP] testing updated SDL right now [WIP]

Update 1:

Update 3

Explanation on `/dev/shm`

Example

Update 4 - python processes exitted eventually

akash-network / awesome-akash

Grok deployment on Akash Network #507

Grok on Akash Network

[WIP] testing updated SDL right now [WIP]

Update 1:

Update 3

Explanation on /dev/shm

Example

Update 4 - python processes exitted eventually

Explanation on `/dev/shm`