akash-network / awesome-akash

Awesome List of Akash Deployment Examples
Apache License 2.0
308 stars 223 forks source link

Fully Dockerized container of Grok for Akash #509

Open Zblocker64 opened 6 months ago

andy108369 commented 6 months ago

Thank you for PR!

This has been tested only up until the SHM related error.

It awaits https://github.com/akash-network/support/issues/179 first.

One can run it if one has access to the provider by setting up the /dev/shm - Memory K8s kind of path as explained here https://github.com/akash-network/awesome-akash/pull/507#issuecomment-2004601755

andy108369 commented 6 months ago

@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:

root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          32  0.0  0.0   4608  2016 pts/0    Ss   20:14   0:00 bash
root         206  0.0  0.0   8480  2016 pts/0    R+   20:15   0:00  \_ ps auxwwf
root           1  0.0  0.0   2576     0 ?        Ss   20:12   0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ;  mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py
root          22  284  0.0 715020 323064 ?       Sl   20:13   6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False

Additionally, it is suggested to use pip install -r requirements.txt instead of pip install <one-by-oe-manually>

refs.

  1. Readme https://github.com/xai-org/grok-1
  2. https://github.com/xai-org/grok-1/issues/164#issuecomment-2004750281
Zblocker64 commented 6 months ago

@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:

root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          32  0.0  0.0   4608  2016 pts/0    Ss   20:14   0:00 bash
root         206  0.0  0.0   8480  2016 pts/0    R+   20:15   0:00  \_ ps auxwwf
root           1  0.0  0.0   2576     0 ?        Ss   20:12   0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ;  mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py
root          22  284  0.0 715020 323064 ?       Sl   20:13   6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False

Additionally, it is suggested to use pip install -r requirements.txt instead of pip install <one-by-oe-manually>

refs.

  1. Readme https://github.com/xai-org/grok-1
  2. python3 process exits eventually (8x h100's) xai-org/grok-1#164 (comment)

Just pushed an update to docker hub. You can use latest or 1.0 as the tag

andy108369 commented 6 months ago

I've tested your image, with the /dev/shm enabled for pod (done it from K8s host), and it eventually Segfaults:

image image

Upstream issue https://github.com/xai-org/grok-1/issues/164#issuecomment-2004922821

Refs.

https://github.com/xai-org/grok-1/issues/164#issuecomment-2004922821 https://github.com/akash-network/awesome-akash/pull/507#issuecomment-2004629622 https://github.com/xai-org/grok-1/issues/152#issuecomment-2004925207

andy108369 commented 6 months ago

Please do not use this image (or any xai-org's grok-1 image) on H100's ! It still locks up the latest nvidia drivers 550.54.15 which then forces us to reboot these nodes.

Details https://github.com/xai-org/grok-1/issues/164#issuecomment-2022572399