danielgross / LlamaAcademy

A school for camelids
MIT License
1.21k stars 77 forks source link

1 RTX A6000 out of memory #5

Closed SolbiatiAlessandro closed 1 year ago

SolbiatiAlessandro commented 1 year ago

Trying as suggested in README.md to run this on "1 RTX A6000" with docker image pytorch:latest f5540ef1a1398b8499546edb53dae704 from https://cloud.vast.ai/

root@C.6176102:~/llamaacademy$ conda env create --file=environment.yaml

Returns out of memory error

CondaError: Failed to write to /opt/conda/pkgs/nccl-2.15.5.1-h0800d71_0.conda
  errno: 28
CondaError: Failed to write to /opt/conda/pkgs/pytorch-2.0.0-cuda112py310he33e0d6_200.conda
  errno: 28
CondaError: Failed to write to /opt/conda/pkgs/libcusparse-12.0.0.76-hcb278e6_1.conda
  errno: 28
CondaError: Failed to write to /opt/conda/pkgs/cuda-sanitizer-api-12.1.105-0.tar.bz2
  errno: 28
[Errno 28] No space left on device: '/opt/conda/pkgs/libzlib-1.2.13-h166bdaf_4.tar.bz2'

Debugging memory

root@C.6176102:~/llamaacademy$ df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          10G  2.4G  7.7G  24% /
tmpfs            64M     0   64M   0% /dev
shm              12G     0   12G   0% /dev/shm
/dev/sdb1       100G  100G   32K 100% /etc/hosts
/dev/sda2        49G   16G   31G  34% /usr/bin/nvidia-smi
tmpfs            25G     0   25G   0% /sys/fs/cgroup
tmpfs            25G   12K   25G   1% /proc/driver/nvidia
tmpfs            25G  4.0K   25G   1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs           4.9G  1.9M  4.9G   1% /run/nvidia-persistenced/socket
udev             25G     0   25G   0% /dev/nvidia1
tmpfs            25G     0   25G   0% /proc/asound
tmpfs            25G     0   25G   0% /proc/acpi
tmpfs            25G     0   25G   0% /proc/scsi
tmpfs            25G     0   25G   0% /sys/firmware
SolbiatiAlessandro commented 1 year ago

One possible solution that works for me is to use 4 RTX A6000 (3$ an hour). The setup was successful there and could start fine tuning.

Might also be possible to use "1 RTX A6000" without running out of memory by using other docker images among these

- pytorch:latest  
  f5540ef1a1398b8499546edb53dae704
PyTorch is a deep learning framework that puts Python first.

-nvidia-glx-desktop:latest  
  f10187106abdbf2eeef2c4d8347aa56f
Ubuntu X desktop streaming using WebRTC and NVENC GPU video compression. Supports Vulkan/OpenGL for GPU rendering. Default username: user password: mypasswd

- stable-diffusion:web-automatic-2.1.16  
  b41a1cd115aeaa64f26ac806ab654d01
Stable Diffusion with Automatic1111 WebUI, Jupyter (for file browser & transfer), and SSH.

- Whisper ASR Webservice  
  e795f6239ba0236393d61d892c3f4152
GPU version of the whisper ASR webservice for podcast and video transcription.

- Bittensor 3.7.0 with cubit  
  325f5bb932cd700e11d7913fe32fad51
Uses the Bittensor 3.7.0 docker image and installs cubit in the onstart script. Once complete, the instance will be ready to run Bittensor on the finney network.

- tensorflow:latest-gpu  
  79a8d3bee306ada066bb42cb3bdef852
Official docker images for deep learning framework TensorFlow (http://www.tensorflow.org)

-cuda:12.0.1-runtime-ubuntu20.04  
  e64e8c759efb02fb5e156600354f4c96
CUDA and cuDNN images from gitlab.com/nvidia/cuda
huyphan168 commented 1 year ago

Hi, I haven't tested with other images rather simple default cuda image with current setup. Using Pytorch image might be the reason the disk is out of memory. Furthermore, it's not necessary to have pytorch image since the environment already has one. I think I have tested with cuda:12.0.1-runtime-ubuntu20.04