Closed SergioArnaud closed 2 years ago
I managed to make EfficientZero train
It's more important than I would've thought the balance between GPUs, CPUs and GPUactors, CPUactors. In this case the problem was that when I was using 28 gpu actors they got full and nothing else could be added causing an infinite loop in some while statement.
I just changed the gpu_actor
to 24 and now EfficientZero is training without a problem.
I also upgraded to ray==1.9
, it helped me a lot with the debugging.
In general seems like I can train 1 out of several experiments I try, the distributed nature of the agent seems to cause a really unstable training experience.
Do you have any recomendations in order to replicate the runs from the paper?
Hi, first of all congratulations on the great work!
I haven't managed to train an agent yet using the EfficientZero framework. The command I'm using to train is the following:
In a cluster with the following architecture:
The problem I'm facing is that even after a while of training there's only the following log:
Also the results folder of the experiment is mostly empty, I only have a train.log with the initial parameters.
I'm not sure if this is just a matter of waiting for a long time or If something in the inner workings is stuck (it looks like the
batch_storage
from the main train loop is always empty since we haven't entered into the train phase yet.Something I think is really weird is that time passes but the GPU Memory-Usage stays exactly the same which makes me think something is off.
Would appreciate any advice in order to make this work. Thanks in advance!