danijar / director

Deep Hierarchical Planning from Pixels
https://danijar.com/director/
89 stars 22 forks source link

"multi_gpu" and "multi_worker" configurations not working #5

Closed jdubkim closed 1 year ago

jdubkim commented 1 year ago

Hi, first of all, thank you so much for sharing such amazing work & code. I really loved the idea and the results of this paper, and am trying to apply some ideas on top of this. However, I have faced some problems. I trained the model for dmc_vision dmc_walker_walk task using GPU with 16GB and 24GB VRAM, but received an out-of-memory error. I changed the batch size to 1, but it did not help fixing the problem. Also, when I ran this on GPU with smaller VRAM (like 8GB or 12GB), I noticed that training process gets stuck after 8008 steps (about 3-5 minutes after training starts). In the paper, it says the training can be done in one day using V100 GPU which has 32GB VRAM. I was wondering if I need a GPU with larger VRAM to train this model. I could infer that this is the case because running dmc_proprio did not have any problem. I think using a model with CNN causes this problem. I was wondering if there is a way to run training on a GPU with smaller VRAM.

Assuming that lack of VRAM is the problem, I also tried to use multi-gpus, and tried "multi_gpu" and "multi_worker" configurations in tfagent.py, but now I am getting a new error as follows:

metrics.update(self.model_opt(model_tape, model_loss, modules))
    File
"/vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/tfutils.py",
ine 246, in __call__  *
        self._opt.apply_gradients(
    File
"/vol/bitbucket/xmbhrl/lib/python3.10/site-packages/keras/optimizer_v2/op
timizer_v2.py", line 671, in apply_gradients
        return tf.__internal__.distribute.interim.maybe_merge_call(
RuntimeError: `merge_call` called while defining a new graph or a
tf.function. This can often happen if the function `fn` passed to
`strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function`
contains a synchronization point, such as aggregating gradients (e.g,
optimizer.apply_gradients), or if the function `fn` uses a control flow
statement which contains a synchronization point in the body. Such behaviors are
not yet supported. Instead, please avoid nested `tf.function`s or control flow
statements that may potentially cross a synchronization boundary, for example,
wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a
`tf.function` or move the control flow out of `fn`. If you are subclassing a
`tf.keras.Model`, please avoid decorating overridden methods `test_step` and
`train_step` in `tf.function`.

There's a high chance that I am using a wrong tensorflow version, so please do understand if I am using wrong dependencies. I checked out the dockerfile and saw that it is using tensorflow 2.8 or 2.9, but when using 2.9, JIT compilation failed. Would be amazing if someone can share if they're also facing similar issues or know the solution to this problem. Thank you so much.

I am using

danijar commented 1 year ago

Hi, that's a bug in TF/XLA/GPU and one of the reasons I've made the switch to JAX. Multi GPU isn't supported in the code base, although you should be able to get it to work esp. when XLA is disabled.

jdubkim commented 1 year ago

Thank you so much for clear answer! Just to make sure if I understood correctly, is training code for dmc_vision dmc_walker_walk supposed to work on GPU with 16GB vram without the bug (or maybe disabling XLA)? Thank you so much again :)

danijar commented 1 year ago

Yes training on a single GPU (with or without XLA) is supposed to work, just not the multi_gpu config.

jdubkim commented 1 year ago

Thank you so much for the reply! I actually initially tried to run dmc_vision dmc_walker_walk on a single GPU (T4 and anther one with 24GB VRAM), but received an OOM error on all of them (dmc_proprio was working fine as it uses vector inputs). I am currently running it on GPU cluster, so I also tested them on actual GPU with 8/12GB VRAM, but they get stuck after 8008 steps where the agent starts to learn. This was the reason why I was trying to run with multiple GPUs.

I apologise for disturbing you with this problem, but I was wondering if there is a way to run this with a limited resource, such as reducing number of environments or reducing batch size. Or should I run it on a GPU with 32GB VRAM just like in the paper? I was planning to explore to use Director as a solution to explainable reinforcement learning for my final year project, so it would be so amazing if you could provide some insights on fixing this issue. Thank you so much for you help and advice :)

This is the log & error message when running on single GPU (Tesla T4). (https://gist.github.com/jdubkim/7b84e201dea348f1c04e50d81b1f7239)

Cmeo97 commented 1 year ago

Hi Denijar,

Thank you so much for releasing this code, it's super helpful! I had the same problem, I tried to run it using the multi_gpu config because I noticed that the steps computed in 1 day are roughly 500k for loconav, pinpad and dmc_vision envs. If I am not wrong, the paper mentions that the training take 1 day in a V100, and since I am using a RTX8000, I was wondering why there is so much difference.. Should I change something in the config file? Thank you so much!

jdubkim commented 1 year ago

@Cmeo97 So I found out (I'm not 100% sure but could not find any other bottlenecks) that fps (which is printed in logger) is about 3-5, which is very low, and that is why the training is too slow. I have not found a way to increase the fps, but it seems like to be a synchronisation issue. Can you also check your log and see if the fps is low?

danijar commented 1 year ago

Is this with the multi_gpu or multi_worker configs? The multi_gpu and multi_worker configs in this repo aren't fully implemented and weren't used for any of the paper experiments. Happy to help in a new ticket if running on a single V100 doesn't work.