Open Zblocker64 opened 6 months ago
@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:
root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 32 0.0 0.0 4608 2016 pts/0 Ss 20:14 0:00 bash
root 206 0.0 0.0 8480 2016 pts/0 R+ 20:15 0:00 \_ ps auxwwf
root 1 0.0 0.0 2576 0 ? Ss 20:12 0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ; mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py
root 22 284 0.0 715020 323064 ? Sl 20:13 6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False
Additionally, it is suggested to use pip install -r requirements.txt
instead of pip install <one-by-oe-manually>
refs.
@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:
root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 32 0.0 0.0 4608 2016 pts/0 Ss 20:14 0:00 bash root 206 0.0 0.0 8480 2016 pts/0 R+ 20:15 0:00 \_ ps auxwwf root 1 0.0 0.0 2576 0 ? Ss 20:12 0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ; mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py root 22 284 0.0 715020 323064 ? Sl 20:13 6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False
Additionally, it is suggested to use
pip install -r requirements.txt
instead ofpip install <one-by-oe-manually>
refs.
Just pushed an update to docker hub. You can use latest or 1.0 as the tag
I've tested your image, with the /dev/shm enabled for pod (done it from K8s host), and it eventually Segfaults:
Upstream issue https://github.com/xai-org/grok-1/issues/164#issuecomment-2004922821
Refs.
https://github.com/xai-org/grok-1/issues/164#issuecomment-2004922821 https://github.com/akash-network/awesome-akash/pull/507#issuecomment-2004629622 https://github.com/xai-org/grok-1/issues/152#issuecomment-2004925207
Please do not use this image (or any xai-org's grok-1 image) on H100's !
It still locks up the latest nvidia drivers 550.54.15
which then forces us to reboot these nodes.
Details https://github.com/xai-org/grok-1/issues/164#issuecomment-2022572399
Thank you for PR!
This has been tested only up until the SHM related error.
It awaits https://github.com/akash-network/support/issues/179 first.
One can run it if one has access to the provider by setting up the /dev/shm -
Memory
K8s kind of path as explained here https://github.com/akash-network/awesome-akash/pull/507#issuecomment-2004601755