danijar / dreamerv3

Mastering Diverse Domains through World Models
https://danijar.com/dreamerv3
MIT License
1.28k stars 219 forks source link

Dockerfile CUDA Image Version, COPY Path, and Execution Instructions Need Updates #84

Closed masonhargrave closed 4 months ago

masonhargrave commented 1 year ago

Hello,

I've faced multiple issues when attempting to set up and run a Docker container using the Dockerfile provided in /dreamerv3/dreamerv3/.

Steps to Reproduce

Following the Dockerfile header's instructions, I tried running

docker run -it --rm --gpus all nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

Realized the image cuda:11.4.2-cudnn8-runtime-ubuntu20.04 no longer exists on Docker hub.

so I siwtched to docker run -it --rm --gpus all nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04 nvidia-smi

and it ran as expected

Updated the base image in Dockerfile:

- Original: FROM nvidia/cuda:11.4.2-cudnn8-devel-ubuntu20.04
- Updated to: FROM nvidia/cuda:11.4.3-cudnn8-devel-ubuntu20.04

Built the docker image using: docker build -f dreamerv3/Dockerfile -t img .

Faced the error: ERROR [ 4/20] COPY scripts scripts

To fix the COPY error, I changed like 33 of the Dockerfile to: COPY dreamerv3/embodied/scripts scripts as #55 does. This is not the most elegant solution to this error and maybe there is a better one.

Used the Docker command from the orginal header: docker run -it --rm --gpus all -v ~/logdir:/logdir img sh scripts/xvfb_run.sh python3 dreamerv3/train.py --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" --configs dmc_vision --task dmc_walker_walk

But due to the Dockerfile changes, had to modify the command to: docker run -it --rm --gpus all -v ~/logdir:/logdir img sh dreamerv3/embodied/scripts/xvfb_run.sh python3 dreamerv3/train.py --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" --configs dmc_vision --task dmc_walker_walk

Note: It may be important to run *`chmod +x /dreamerv3/dreamerv3/embodied/scripts/`** to avoid problems for the next step

After running the updated Docker Command, I encountered this error: RuntimeError: Unknown backend: 'gpu' requested, but no platforms that are instances of gpu are present. Platforms present are: cpu

I get this output:

==========
== CUDA ==
==========

CUDA Version 11.4.3

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Config:
seed:                       0                        (int)
method:                     name                     (str)
task:                       dmc_walker_walk          (str)
logdir:                     /logdir/20230824-194020  (str)
replay:                     uniform                  (str)
replay_size:                1000000.0                (float)
replay_online:              False                    (bool)
eval_dir:                                            (str)
filter:                     .*                       (str)
jax.platform:               gpu                      (str)
jax.jit:                    True                     (bool)
jax.precision:              float16                  (str)
jax.prealloc:               True                     (bool)
jax.debug_nans:             False                    (bool)
jax.logical_cpus:           0                        (int)
jax.debug:                  False                    (bool)
jax.policy_devices:         [0]                      (ints)
jax.train_devices:          [0]                      (ints)
jax.metrics_every:          10                       (int)
run.script:                 train                    (str)
run.steps:                  10000000000.0            (float)
run.expl_until:             0                        (int)
run.log_every:              300                      (int)
run.save_every:             900                      (int)
run.eval_every:             1000000.0                (float)
run.eval_initial:           True                     (bool)
run.eval_eps:               1                        (int)
run.eval_samples:           1                        (int)
run.train_ratio:            512.0                    (float)
run.train_fill:             0                        (int)
run.eval_fill:              0                        (int)
run.log_zeros:              False                    (bool)
run.log_keys_video:         [image]                  (strs)
run.log_keys_sum:           ^$                       (str)
run.log_keys_mean:          (log_entropy)            (str)
run.log_keys_max:           ^$                       (str)
run.from_checkpoint:                                 (str)
run.sync_every:             10                       (int)
run.actor_addr:             ipc:///tmp/5551          (str)
run.actor_batch:            32                       (int)
envs.amount:                4                        (int)
envs.parallel:              process                  (str)
envs.length:                0                        (int)
envs.reset:                 True                     (bool)
envs.restart:               True                     (bool)
envs.discretize:            0                        (int)
envs.checks:                False                    (bool)
wrapper.length:             0                        (int)
wrapper.reset:              True                     (bool)
wrapper.discretize:         0                        (int)
wrapper.checks:             False                    (bool)
env.atari.size:             [64, 64]                 (ints)
env.atari.repeat:           4                        (int)
env.atari.sticky:           True                     (bool)
env.atari.gray:             False                    (bool)
env.atari.actions:          all                      (str)
env.atari.lives:            unused                   (str)
env.atari.noops:            0                        (int)
env.atari.resize:           opencv                   (str)
env.dmlab.size:             [64, 64]                 (ints)
env.dmlab.repeat:           4                        (int)
env.dmlab.episodic:         True                     (bool)
env.minecraft.size:         [64, 64]                 (ints)
env.minecraft.break_speed:  100.0                    (float)
env.dmc.size:               [64, 64]                 (ints)
env.dmc.repeat:             2                        (int)
env.dmc.camera:             -1                       (int)
env.loconav.size:           [64, 64]                 (ints)
env.loconav.repeat:         2                        (int)
env.loconav.camera:         -1                       (int)
task_behavior:              Greedy                   (str)
expl_behavior:              None                     (str)
batch_size:                 16                       (int)
batch_length:               64                       (int)
data_loaders:               8                        (int)
grad_heads:                 [decoder, reward, cont]  (strs)
rssm.deter:                 512                      (int)
rssm.units:                 512                      (int)
rssm.stoch:                 32                       (int)
rssm.classes:               32                       (int)
rssm.act:                   silu                     (str)
rssm.norm:                  layer                    (str)
rssm.initial:               learned                  (str)
rssm.unimix:                0.01                     (float)
rssm.unroll:                False                    (bool)
rssm.action_clip:           1.0                      (float)
rssm.winit:                 normal                   (str)
rssm.fan:                   avg                      (str)
encoder.mlp_keys:           $^                       (str)
encoder.cnn_keys:           image                    (str)
encoder.act:                silu                     (str)
encoder.norm:               layer                    (str)
encoder.mlp_layers:         5                        (int)
encoder.mlp_units:          1024                     (int)
encoder.cnn:                resnet                   (str)
encoder.cnn_depth:          32                       (int)
encoder.cnn_blocks:         0                        (int)
encoder.resize:             stride                   (str)
encoder.winit:              normal                   (str)
encoder.fan:                avg                      (str)
encoder.symlog_inputs:      True                     (bool)
encoder.minres:             4                        (int)
decoder.mlp_keys:           $^                       (str)
decoder.cnn_keys:           image                    (str)
decoder.act:                silu                     (str)
decoder.norm:               layer                    (str)
decoder.mlp_layers:         5                        (int)
decoder.mlp_units:          1024                     (int)
decoder.cnn:                resnet                   (str)
decoder.cnn_depth:          32                       (int)
decoder.cnn_blocks:         0                        (int)
decoder.image_dist:         mse                      (str)
decoder.vector_dist:        symlog_mse               (str)
decoder.inputs:             [deter, stoch]           (strs)
decoder.resize:             stride                   (str)
decoder.winit:              normal                   (str)
decoder.fan:                avg                      (str)
decoder.outscale:           1.0                      (float)
decoder.minres:             4                        (int)
decoder.cnn_sigmoid:        False                    (bool)
reward_head.layers:         2                        (int)
reward_head.units:          512                      (int)
reward_head.act:            silu                     (str)
reward_head.norm:           layer                    (str)
reward_head.dist:           symlog_disc              (str)
reward_head.outscale:       0.0                      (float)
reward_head.outnorm:        False                    (bool)
reward_head.inputs:         [deter, stoch]           (strs)
reward_head.winit:          normal                   (str)
reward_head.fan:            avg                      (str)
reward_head.bins:           255                      (int)
cont_head.layers:           2                        (int)
cont_head.units:            512                      (int)
cont_head.act:              silu                     (str)
cont_head.norm:             layer                    (str)
cont_head.dist:             binary                   (str)
cont_head.outscale:         1.0                      (float)
cont_head.outnorm:          False                    (bool)
cont_head.inputs:           [deter, stoch]           (strs)
cont_head.winit:            normal                   (str)
cont_head.fan:              avg                      (str)
loss_scales.image:          1.0                      (float)
loss_scales.vector:         1.0                      (float)
loss_scales.reward:         1.0                      (float)
loss_scales.cont:           1.0                      (float)
loss_scales.dyn:            0.5                      (float)
loss_scales.rep:            0.1                      (float)
loss_scales.actor:          1.0                      (float)
loss_scales.critic:         1.0                      (float)
loss_scales.slowreg:        1.0                      (float)
dyn_loss.impl:              kl                       (str)
dyn_loss.free:              1.0                      (float)
rep_loss.impl:              kl                       (str)
rep_loss.free:              1.0                      (float)
model_opt.opt:              adam                     (str)
model_opt.lr:               0.0001                   (float)
model_opt.eps:              1e-08                    (float)
model_opt.clip:             1000.0                   (float)
model_opt.wd:               0.0                      (float)
model_opt.warmup:           0                        (int)
model_opt.lateclip:         0.0                      (float)
actor.layers:               2                        (int)
actor.units:                512                      (int)
actor.act:                  silu                     (str)
actor.norm:                 layer                    (str)
actor.minstd:               0.1                      (float)
actor.maxstd:               1.0                      (float)
actor.outscale:             1.0                      (float)
actor.outnorm:              False                    (bool)
actor.unimix:               0.01                     (float)
actor.inputs:               [deter, stoch]           (strs)
actor.winit:                normal                   (str)
actor.fan:                  avg                      (str)
actor.symlog_inputs:        False                    (bool)
critic.layers:              2                        (int)
critic.units:               512                      (int)
critic.act:                 silu                     (str)
critic.norm:                layer                    (str)
critic.dist:                symlog_disc              (str)
critic.outscale:            0.0                      (float)
critic.outnorm:             False                    (bool)
critic.inputs:              [deter, stoch]           (strs)
critic.winit:               normal                   (str)
critic.fan:                 avg                      (str)
critic.bins:                255                      (int)
critic.symlog_inputs:       False                    (bool)
actor_opt.opt:              adam                     (str)
actor_opt.lr:               3e-05                    (float)
actor_opt.eps:              1e-05                    (float)
actor_opt.clip:             100.0                    (float)
actor_opt.wd:               0.0                      (float)
actor_opt.warmup:           0                        (int)
actor_opt.lateclip:         0.0                      (float)
critic_opt.opt:             adam                     (str)
critic_opt.lr:              3e-05                    (float)
critic_opt.eps:             1e-05                    (float)
critic_opt.clip:            100.0                    (float)
critic_opt.wd:              0.0                      (float)
critic_opt.warmup:          0                        (int)
critic_opt.lateclip:        0.0                      (float)
actor_dist_disc:            onehot                   (str)
actor_dist_cont:            normal                   (str)
actor_grad_disc:            reinforce                (str)
actor_grad_cont:            backprop                 (str)
critic_type:                vfunction                (str)
imag_horizon:               15                       (int)
imag_unroll:                False                    (bool)
horizon:                    333                      (int)
return_lambda:              0.95                     (float)
critic_slowreg:             logprob                  (str)
slow_critic_update:         1                        (int)
slow_critic_fraction:       0.02                     (float)
retnorm.impl:               perc_ema                 (str)
retnorm.decay:              0.99                     (float)
retnorm.max:                1.0                      (float)
retnorm.perclo:             5.0                      (float)
retnorm.perchi:             95.0                     (float)
actent:                     0.0003                   (float)
expl_rewards.extr:          1.0                      (float)
expl_rewards.disag:         0.1                      (float)
expl_opt.opt:               adam                     (str)
expl_opt.lr:                0.0001                   (float)
expl_opt.eps:               1e-05                    (float)
expl_opt.clip:              100.0                    (float)
expl_opt.wd:                0.0                      (float)
expl_opt.warmup:            0                        (int)
disag_head.layers:          2                        (int)
disag_head.units:           512                      (int)
disag_head.act:             silu                     (str)
disag_head.norm:            layer                    (str)
disag_head.dist:            mse                      (str)
disag_head.outscale:        1.0                      (float)
disag_head.inputs:          [deter, stoch, action]   (strs)
disag_head.winit:           normal                   (str)
disag_head.fan:             avg                      (str)
disag_target:               [stoch]                  (strs)
disag_models:               8                        (int)
Encoder CNN shapes: {'image': (64, 64, 3)}
Encoder MLP shapes: {}
Decoder CNN shapes: {'image': (64, 64, 3)}
Decoder MLP shapes: {}
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /embodied/dreamerv3/train.py:206 in <module>                                                     │
│                                                                                                  │
│   203                                                                                            │
│   204                                                                                            │
│   205 if __name__ == '__main__':                                                                 │
│ ❱ 206   main()                                                                                   │
│   207                                                                                            │
│                                                                                                  │
│ /embodied/dreamerv3/train.py:49 in main                                                          │
│                                                                                                  │
│    46 │     replay = make_replay(config, logdir / 'replay')                                      │
│    47 │     env = make_envs(config)                                                              │
│    48 │     cleanup.append(env)                                                                  │
│ ❱  49 │     agent = agt.Agent(env.obs_space, env.act_space, step, config)                        │
│    50 │     embodied.run.train(agent, env, replay, logger, args)                                 │
│    51 │                                                                                          │
│    52 │   elif args.script == 'train_save':                                                      │
│                                                                                                  │
│ /embodied/dreamerv3/jaxagent.py:20 in __init__                                                   │
│                                                                                                  │
│    17 │   configs = agent_cls.configs                                                            │
│    18 │   inner = agent_cls                                                                      │
│    19 │   def __init__(self, *args, **kwargs):                                                   │
│ ❱  20 │     super().__init__(agent_cls, *args, **kwargs)                                         │
│    21   return Agent                                                                             │
│    22                                                                                            │
│    23                                                                                            │
│                                                                                                  │
│ /embodied/dreamerv3/jaxagent.py:35 in __init__                                                   │
│                                                                                                  │
│    32 │   self.agent = agent_cls(obs_space, act_space, step, config, name='agent')               │
│    33 │   self.rng = np.random.default_rng(config.seed)                                          │
│    34 │                                                                                          │
│ ❱  35 │   available = jax.devices(self.config.platform)                                          │
│    36 │   self.policy_devices = [available[i] for i in self.config.policy_devices]               │
│    37 │   self.train_devices = [available[i] for i in self.config.train_devices]                 │
│    38 │   self.single_device = (self.policy_devices == self.train_devices) and (                 │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/jax/_src/xla_bridge.py:758 in devices                     │
│                                                                                                  │
│   755   Returns:                                                                                 │
│   756 │   List of Device subclasses.                                                             │
│   757   """                                                                                      │
│ ❱ 758   return get_backend(backend).devices()                                                    │
│   759                                                                                            │
│   760                                                                                            │
│   761 def default_backend() -> str:                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/jax/_src/xla_bridge.py:692 in get_backend                 │
│                                                                                                  │
│   689 def get_backend(                                                                           │
│   690 │   platform: Union[None, str, xla_client.Client] = None                                   │
│   691 ) -> xla_client.Client:                                                                    │
│ ❱ 692   return _get_backend_uncached(platform)                                                   │
│   693                                                                                            │
│   694                                                                                            │
│   695 def get_device_backend(                                                                    │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/jax/_src/xla_bridge.py:675 in _get_backend_uncached       │
│                                                                                                  │
│   672                                                                                            │
│   673   bs = backends()                                                                          │
│   674   if platform is not None:                                                                 │
│ ❱ 675 │   platform = canonicalize_platform(platform)                                             │
│   676 │   backend = bs.get(platform, None)                                                       │
│   677 │   if backend is None:                                                                    │
│   678 │     if platform in _backends_errors:                                                     │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/jax/_src/xla_bridge.py:548 in canonicalize_platform       │
│                                                                                                  │
│   545   for p in platforms:                                                                      │
│   546 │   if p in b.keys():                                                                      │
│   547 │     return p                                                                             │
│ ❱ 548   raise RuntimeError(f"Unknown backend: '{platform}' requested, but no "                   │
│   549 │   │   │   │   │    f"platforms that are instances of {platform} are present. "           │
│   550 │   │   │   │   │    "Platforms present are: " + ",".join(b.keys()))                       │
│   551                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Unknown backend: 'gpu' requested, but no platforms that are instances of gpu are 
present. Platforms present are: cpu

Expected Result:

Suggestions:

System Information

nvidia-smi output:

Thu Aug 24 20:12:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P8    25W / 350W |    428MiB / 12288MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2132      G   /usr/lib/xorg/Xorg                125MiB |
|    0   N/A  N/A      2298      G   /usr/bin/gnome-shell               31MiB |
|    0   N/A  N/A      4245      G   ...9/usr/lib/firefox/firefox      269MiB |
+-----------------------------------------------------------------------------+

Environment Inside the Docker Container

`nvcc --version`` output

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0

So cuda version 11.4 as expected

cat /usr/include/cudnn_version.h | grep CUDENN_MAJOR output:

#define CUDNN_MAJOR 8
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

so cuDNN 8 as expected

pip list output:

---------------------------- --------------------
absl-py                      1.4.0
asttokens                    2.2.1
astunparse                   1.6.3
atari-py                     0.2.9
backcall                     0.2.0
backports.zoneinfo           0.2.1
bsuite                       0.3.5
bullet                       2.2.0
cachetools                   5.3.1
certifi                      2019.11.28
chardet                      3.0.4
chex                         0.1.7
cloudpickle                  1.6.0
coloredlogs                  15.0.1
contourpy                    1.1.0
crafter                      1.8.1
cycler                       0.11.0
daemoniker                   0.2.3
dbus-python                  1.2.16
decorator                    5.1.1
deepmind-lab                 1.0
dill                         0.3.7
distro-info                  0.23+ubuntu1.1
dm-control                   1.0.14
dm-env                       1.6
dm-tree                      0.1.8
exceptiongroup               1.1.3
executing                    1.2.0
filelock                     3.12.2
flatbuffers                  23.5.26
fonttools                    4.42.1
frozendict                   2.3.8
gast                         0.4.0
getch                        1.0
glfw                         2.6.2
google-auth                  2.22.0
google-auth-oauthlib         1.0.0
google-pasta                 0.2.0
grpcio                       1.57.0
gym                          0.19.0
h5py                         3.9.0
humanfriendly                10.0
idna                         2.8
imageio                      2.31.1
importlib-metadata           6.8.0
importlib-resources          6.0.1
inflection                   0.5.1
iniconfig                    2.0.0
ipython                      8.12.2
jax                          0.4.13
jaxlib                       0.4.13
jedi                         0.19.0
Jinja2                       3.1.2
keras                        2.13.1
kiwisolver                   1.4.5
labmaze                      1.0.6
lazy_loader                  0.3
libclang                     16.0.6
lxml                         4.9.3
Markdown                     3.4.4
markdown-it-py               3.0.0
MarkupSafe                   2.1.3
matplotlib                   3.7.2
matplotlib-inline            0.1.6
mdurl                        0.1.2
minerl                       0.4.4
mizani                       0.9.2
ml-dtypes                    0.2.0
msgpack                      1.0.5
mujoco                       2.3.7
networkx                     3.1
numpy                        1.24.3
oauthlib                     3.2.2
opencv-python                4.8.0.76
opensimplex                  0.4.5
opt-einsum                   3.3.0
optax                        0.1.7
packaging                    23.1
pandas                       2.0.3
parso                        0.8.3
patsy                        0.5.3
pexpect                      4.8.0
pickleshare                  0.7.5
Pillow                       10.0.0
pip                          23.2.1
plotnine                     0.12.2
pluggy                       1.2.0
prompt-toolkit               3.0.39
protobuf                     4.24.1
psutil                       5.9.5
ptyprocess                   0.7.0
pure-eval                    0.2.2
pyasn1                       0.5.0
pyasn1-modules               0.3.0
Pygments                     2.16.1
PyGObject                    3.36.0
PyOpenGL                     3.1.7
pyparsing                    3.0.9
Pyro4                        4.82
pytest                       7.4.0
python-apt                   2.0.1+ubuntu0.20.4.1
python-dateutil              2.8.2
pytz                         2023.3
PyWavelets                   1.4.1
pyzmq                        25.1.1
requests                     2.22.0
requests-oauthlib            1.3.1
requests-unixsocket          0.2.0
rich                         13.5.2
robodesk                     1.0.0
rsa                          4.9
ruamel.yaml                  0.17.32
ruamel.yaml.clib             0.2.7
scikit-image                 0.21.0
scipy                        1.10.1
serpent                      1.41
setuptools                   68.1.2
simple-term-menu             1.6.1
six                          1.16.0
stack-data                   0.6.2
statsmodels                  0.14.0
tensorboard                  2.13.0
tensorboard-data-server      0.7.1
tensorflow-cpu               2.13.0
tensorflow-estimator         2.13.0
tensorflow-io-gcs-filesystem 0.33.0
tensorflow-probability       0.21.0
termcolor                    2.3.0
tifffile                     2023.7.10
tomli                        2.0.1
toolz                        0.12.0
tqdm                         4.66.1
traitlets                    5.9.0
typing_extensions            4.5.0
tzdata                       2023.3
unattended-upgrades          0.1
urllib3                      1.25.8
wcwidth                      0.2.6
Werkzeug                     2.3.7
wheel                        0.34.2
wrapt                        1.15.0
xmltodict                    0.12.0
zipp                         3.16.2
zmq                          0.0.0

Additional Debugging Attempts

Jax Related

JAX Configuration

I tried running

import jax
jax.config.update("jax_platform_name", "gpu")

JAX GPU Detection

import jax

print(jax.devices())

yields

2023-08-25 00:34:00.086311: I external/xla/xla/pjrt/tfrt_cpu_pjrt_client.cc:458] TfrtCpuClient created.
2023-08-25 00:34:00.087434: I external/xla/xla/stream_executor/tpu/tpu_initializer_helper.cc:269] Libtpu path is: libtpu.so
2023-08-25 00:34:00.087837: I external/xla/xla/stream_executor/tpu/tpu_initializer_helper.cc:277] Failed to open libtpu: libtpu.so: cannot open shared object file: No such file or directory
2023-08-25 00:34:00.087959: I external/xla/xla/stream_executor/tpu/tpu_platform_interface.cc:73] No TPU platform found.

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[CpuDevice(id=0)]

with the log levels set as indicated.

TensorFlow Related

GPU Test

import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

results in and empty list [] So this gpu recognition problem is not isolated to JAX

Installation in Dockerfile

I went through a whole arc where I thought the problem might be the fact that in the Dockerfile only tensorflow_probability and tensorflow-cpu are installed. I thought this could be the issue so I changed the Dockerfile to install tensorflow==2.13.* instead, but the error remained unchanged.

Environment Variables

These may or may not be relevant

Base Images

I tried various base images including

danijar commented 1 year ago

Hi, apologies for not being able to read the whole report. Did you try running nvidia-smi in the base image to test your nvidia docker setup (independently of the DreamerV3 Dockerfile)?

masonhargrave commented 1 year ago

Hi, apologies for not being able to read the whole report. Did you try running nvidia-smi in the base image to test your nvidia docker setup (independently of the DreamerV3 Dockerfile)?

Yup! Running nvidia-smi in the base image works just fine (see the very first section of "steps to reproduce as well as nvidia-smi output under the System Information heading).

masonhargrave commented 1 year ago

Update on Dockerfile Issue and Progress

I've made some progress and mostly resolved the Docker build and run issues. Below are the changes made to the Dockerfile along with explanations.

Changes and Explanations:

  1. CUDA Image Version:

    • Original: FROM nvidia/cuda:11.4.2-cudnn8-devel-ubuntu20.04
    • Updated: FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
      The original base image cuda:11.4.2-cudnn8-runtime-ubuntu20.04 no longer exists, so it's updated to a later version which is compatible with the latest version of TensorFlow (2.13).
  2. COPY Command:

    • Original: COPY scripts scripts
    • Updated: COPY dreamerv3/embodied/scripts scripts
      The Docker build was failing due to the wrong path. It has been fixed by updating the path (Please check this is the best way to have updated this the file system inside the Docker container has some strange nested directories now but hey, it works!).
  3. TensorFlow and cuDNN Setup:

    • Original Dockerfile had tensorflow-cpu which lead to the GPU not being visible in the container
    • Used Miniconda to install TensorFlow and cuDNN compatible with the CUDA version.
    • Set the LD_LIBRARY_PATH environment variable for cuDNN.
  4. Environment Variables:

    • Original: ENV MUJOCO_GL egl
    • Updated: ENV MUJOCO_GL=osmesa
      Changed the MUJOCO_GL environment variable. (Please check if this is acceptable or not)
  5. Agent Dependencies:

    • Original: RUN pip3 install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
    • Updated: RUN pip3 install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
      Updated the jax package installation to be compatible with the new CUDA version.

Remaining Issue:

Everything seems to work now except there are still errors with running install-atari.sh and install-minecraft.sh. These issues have been discussed elsewhere #79 but as I don't personally need either of those to run, I'm going to leave them commented out for now.

Dockerfile


# Prerequisites: Nsuyre you have installed NVIDIA Container Toolkit as per https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
#
# 1. Test setup:
# docker run -it --rm --gpus all nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 nvidia-smi
#
# If the above does not work, try adding the --privileged flag
# and changing the command to `sh -c 'ldconfig -v && nvidia-smi'`.
#
# 2. Start training:
# docker build -f  dreamerv3/Dockerfile -t img . && \
# docker run -it --rm --gpus all -v ~/logdir:/logdir img \
#   sh scripts/xvfb_run.sh python3 dreamerv3/train.py \
#   --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" \
#   --configs dmc_vision --task dmc_walker_walk
#
# 3. See results:
# tensorboard --logdir ~/logdir

# System
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=America/San_Francisco
ENV PYTHONUNBUFFERED 1
ENV PIP_DISABLE_PIP_VERSION_CHECK 1
ENV PIP_NO_CACHE_DIR 1
RUN apt-get update && apt-get install -y \
  ffmpeg git python3-pip vim libglew-dev \
  x11-xserver-utils xvfb curl libegl1-mesa \
  && apt-get clean

# TensorFlow Install 
RUN curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
ENV PATH /opt/conda/bin:$PATH
RUN conda update -n base -c defaults conda
RUN conda install -c conda-forge cudatoolkit=11.8.0
RUN python -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.*
RUN mkdir -p $CONDA_PREFIX/etc/conda/activate.d
RUN echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
RUN echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
RUN bash -c "source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh"

RUN pip3 install --upgrade pip

# Envs
ENV MUJOCO_GL=osmesa
COPY dreamerv3/embodied/scripts scripts
RUN sh scripts/install-dmlab.sh
# RUN sh scripts/install-atari.sh
# RUN sh scripts/install-minecraft.sh
ENV NUMBA_CACHE_DIR=/tmp
RUN pip3 install crafter
RUN pip3 install dm_control
RUN pip3 install robodesk
RUN pip3 install bsuite

# Agent
RUN pip3 install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RUN pip3 install jaxlib
RUN pip3 install tensorflow_probability
RUN pip3 install optax
ENV XLA_PYTHON_CLIENT_MEM_FRACTION 0.8

# Google Cloud DNS cache (optional)
ENV GCS_RESOLVE_REFRESH_SECS=60
ENV GCS_REQUEST_CONNECTION_TIMEOUT_SECS=300
ENV GCS_METADATA_REQUEST_TIMEOUT_SECS=300
ENV GCS_READ_REQUEST_TIMEOUT_SECS=300
ENV GCS_WRITE_REQUEST_TIMEOUT_SECS=600

# Embodied
RUN pip3 install numpy cloudpickle ruamel.yaml rich zmq msgpack
COPY . /embodied
RUN chown -R 1000:root /embodied && chmod -R 775 /embodied

WORKDIR embodied
sbhavani commented 12 months ago

I'd recommend using a JAX base container from JAX Toolbox which is validated with a nightly CI on NVIDIA GPUs. The Dockerfiles are open to modify as well.

danijar commented 5 months ago

Hi all, is this still an issue with the updated code? It's working well for me.