Closed masonhargrave closed 4 months ago
Hi, apologies for not being able to read the whole report. Did you try running nvidia-smi in the base image to test your nvidia docker setup (independently of the DreamerV3 Dockerfile)?
Hi, apologies for not being able to read the whole report. Did you try running nvidia-smi in the base image to test your nvidia docker setup (independently of the DreamerV3 Dockerfile)?
Yup! Running nvidia-smi in the base image works just fine (see the very first section of "steps to reproduce as well as nvidia-smi output under the System Information heading).
I've made some progress and mostly resolved the Docker build and run issues. Below are the changes made to the Dockerfile along with explanations.
CUDA Image Version:
FROM nvidia/cuda:11.4.2-cudnn8-devel-ubuntu20.04
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
cuda:11.4.2-cudnn8-runtime-ubuntu20.04
no longer exists, so it's updated to a later version which is compatible with the latest version of TensorFlow (2.13).COPY Command:
COPY scripts scripts
COPY dreamerv3/embodied/scripts scripts
TensorFlow and cuDNN Setup:
LD_LIBRARY_PATH
environment variable for cuDNN.Environment Variables:
ENV MUJOCO_GL egl
ENV MUJOCO_GL=osmesa
MUJOCO_GL
environment variable. (Please check if this is acceptable or not)Agent Dependencies:
RUN pip3 install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RUN pip3 install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
jax
package installation to be compatible with the new CUDA version.Everything seems to work now except there are still errors with running install-atari.sh
and install-minecraft.sh
. These issues have been discussed elsewhere #79 but as I don't personally need either of those to run, I'm going to leave them commented out for now.
# Prerequisites: Nsuyre you have installed NVIDIA Container Toolkit as per https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
#
# 1. Test setup:
# docker run -it --rm --gpus all nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 nvidia-smi
#
# If the above does not work, try adding the --privileged flag
# and changing the command to `sh -c 'ldconfig -v && nvidia-smi'`.
#
# 2. Start training:
# docker build -f dreamerv3/Dockerfile -t img . && \
# docker run -it --rm --gpus all -v ~/logdir:/logdir img \
# sh scripts/xvfb_run.sh python3 dreamerv3/train.py \
# --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" \
# --configs dmc_vision --task dmc_walker_walk
#
# 3. See results:
# tensorboard --logdir ~/logdir
# System
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=America/San_Francisco
ENV PYTHONUNBUFFERED 1
ENV PIP_DISABLE_PIP_VERSION_CHECK 1
ENV PIP_NO_CACHE_DIR 1
RUN apt-get update && apt-get install -y \
ffmpeg git python3-pip vim libglew-dev \
x11-xserver-utils xvfb curl libegl1-mesa \
&& apt-get clean
# TensorFlow Install
RUN curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
ENV PATH /opt/conda/bin:$PATH
RUN conda update -n base -c defaults conda
RUN conda install -c conda-forge cudatoolkit=11.8.0
RUN python -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.*
RUN mkdir -p $CONDA_PREFIX/etc/conda/activate.d
RUN echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
RUN echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
RUN bash -c "source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh"
RUN pip3 install --upgrade pip
# Envs
ENV MUJOCO_GL=osmesa
COPY dreamerv3/embodied/scripts scripts
RUN sh scripts/install-dmlab.sh
# RUN sh scripts/install-atari.sh
# RUN sh scripts/install-minecraft.sh
ENV NUMBA_CACHE_DIR=/tmp
RUN pip3 install crafter
RUN pip3 install dm_control
RUN pip3 install robodesk
RUN pip3 install bsuite
# Agent
RUN pip3 install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RUN pip3 install jaxlib
RUN pip3 install tensorflow_probability
RUN pip3 install optax
ENV XLA_PYTHON_CLIENT_MEM_FRACTION 0.8
# Google Cloud DNS cache (optional)
ENV GCS_RESOLVE_REFRESH_SECS=60
ENV GCS_REQUEST_CONNECTION_TIMEOUT_SECS=300
ENV GCS_METADATA_REQUEST_TIMEOUT_SECS=300
ENV GCS_READ_REQUEST_TIMEOUT_SECS=300
ENV GCS_WRITE_REQUEST_TIMEOUT_SECS=600
# Embodied
RUN pip3 install numpy cloudpickle ruamel.yaml rich zmq msgpack
COPY . /embodied
RUN chown -R 1000:root /embodied && chmod -R 775 /embodied
WORKDIR embodied
I'd recommend using a JAX base container from JAX Toolbox which is validated with a nightly CI on NVIDIA GPUs. The Dockerfiles are open to modify as well.
Hi all, is this still an issue with the updated code? It's working well for me.
Hello,
I've faced multiple issues when attempting to set up and run a Docker container using the Dockerfile provided in
/dreamerv3/dreamerv3/
.Steps to Reproduce
Following the Dockerfile header's instructions, I tried running
docker run -it --rm --gpus all nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
Realized the image
cuda:11.4.2-cudnn8-runtime-ubuntu20.04
no longer exists on Docker hub.so I siwtched to
docker run -it --rm --gpus all nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04 nvidia-smi
and it ran as expected
Updated the base image in Dockerfile:
Built the docker image using:
docker build -f dreamerv3/Dockerfile -t img .
Faced the error:
ERROR [ 4/20] COPY scripts scripts
To fix the
COPY
error, I changed like 33 of the Dockerfile to:COPY dreamerv3/embodied/scripts scripts
as #55 does. This is not the most elegant solution to this error and maybe there is a better one.Used the Docker command from the orginal header:
docker run -it --rm --gpus all -v ~/logdir:/logdir img sh scripts/xvfb_run.sh python3 dreamerv3/train.py --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" --configs dmc_vision --task dmc_walker_walk
But due to the Dockerfile changes, had to modify the command to: docker run -it --rm --gpus all -v ~/logdir:/logdir img sh dreamerv3/embodied/scripts/xvfb_run.sh python3 dreamerv3/train.py --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" --configs dmc_vision --task dmc_walker_walk
Note: It may be important to run *`chmod +x /dreamerv3/dreamerv3/embodied/scripts/`** to avoid problems for the next step
After running the updated Docker Command, I encountered this error:
RuntimeError: Unknown backend: 'gpu' requested, but no platforms that are instances of gpu are present. Platforms present are: cpu
I get this output:
Expected Result:
COPY
command in the Dockerfile should correctly locate the scripts directory.Actual Result:
COPY
command is referencing an incorrect path causing the Docker build tofail.Suggestions:
COPY
command.System Information
nvidia-smi
output:Environment Inside the Docker Container
`nvcc --version`` output
So cuda version 11.4 as expected
cat /usr/include/cudnn_version.h | grep CUDENN_MAJOR
output:so cuDNN 8 as expected
pip list
output:Additional Debugging Attempts
Jax Related
JAX Configuration
I tried running
JAX GPU Detection
yields
with the log levels set as indicated.
TensorFlow Related
GPU Test
results in and empty list
[]
So this gpu recognition problem is not isolated to JAXInstallation in Dockerfile
I went through a whole arc where I thought the problem might be the fact that in the Dockerfile only
tensorflow_probability
andtensorflow-cpu
are installed. I thought this could be the issue so I changed the Dockerfile to installtensorflow==2.13.*
instead, but the error remained unchanged.Environment Variables
These may or may not be relevant
echo $CUDA_VISIBLE_DEVICES
returns nothingecho $LD_LIBRARY_PATH
returns/usr/local/nvidia/lib:/usr/local/nvidia/lib64
/usr/local/nvidia
doesn't even exist in the container.echo $CUDNN_INCLUDE_DIR
returns nothing/usr/include
where the cuDNN files areecho $CUDA_HOME
returns nothingecho $PATH
returns/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Base Images
I tried various base images including
11.8.0-cudnn8-devel-ubuntu20.04
12.0.0-cudnn8-devel-ubuntu20.04
12.0.0-cudnn8-devel-ubuntu22.04