LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.94k stars 3.22k forks source link

Model training developer setup #3563

Closed theophilegervet closed 1 year ago

theophilegervet commented 1 year ago

I'm trying to set up the developer environment to run supervised fine-tuning.

When running pip install -e .. from this Readme https://github.com/LAION-AI/Open-Assistant/blob/main/model/model_training/README.md with CUDA_HOME=/usr/local/cuda-11.4, I get

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-n8yrafio/deepspeed_fe91fb641327472a9e0c07a08a62b4f2/setup.py", line 82, in <module>
          cuda_major_ver, cuda_minor_ver = installed_cuda_version()
        File "/tmp/pip-install-n8yrafio/deepspeed_fe91fb641327472a9e0c07a08a62b4f2/op_builder/builder.py", line 43, in installed_cuda_version
          output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
        File "/private/home/theop123/miniconda3/envs/open-assistant/lib/python3.10/subprocess.py", line 421, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/private/home/theop123/miniconda3/envs/open-assistant/lib/python3.10/subprocess.py", line 503, in run
          with Popen(*popenargs, **kwargs) as process:
        File "/private/home/theop123/miniconda3/envs/open-assistant/lib/python3.10/subprocess.py", line 971, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/private/home/theop123/miniconda3/envs/open-assistant/lib/python3.10/subprocess.py", line 1863, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda-11.4/bin/nvcc'
      [end of output]

Indeed /usr/local/cuda-11.4 does not contain nvcc.

In a conda environment with Python 3.10 with PyTorch installed via:

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

nvidia-smi gives

Driver Version: 470.141.03   CUDA Version: 11.4

nvcc --version gives

zsh: command not found: nvcc
theophilegervet commented 1 year ago

export CUDA_HOME=$CONDA_PREFIX fixes it