Digital-Humans-23 / a2

4 stars 0 forks source link

Training time bob #10

Closed steinraf closed 1 year ago

steinraf commented 1 year ago

I have been running all the runs locally on my laptop where I can achieve a speedup of around 4x compared to Euler (1:30 for the dog compared to around 6.5 hours according to people running it on Euler)

For the humanoid standing task it took 11 hours and for people on Euler it seems to be taking significantly longer than estimated as well (projected duration of 35-40 hours), leading in timeouts when only asking for 18 hours of compute time.

Is there something that changed with the environment or is this just how it is?

Thanks!

MiguelZamoraM commented 1 year ago

Thanks for bringing this up!

Running experiments on your own machines is definitely a possibility and a good idea if you have a relatively good machine. Usually, the CPUs on a server don't run at the maximum frequency (which helps to increase the lifespan of the CPUs). So, it is reasonable that you get some speedup when running experiments on your own machine.

As not everybody has access to the same type of resources, we decided to set things up so that everybody could run experiments on the server. In the internal tests that we did, we were able to run all the experiments in less than 18 hours. As a heuristic, it is recommended not to run jobs that last more than 24 hours. So, there's still some additional margin.

IMPORTANT: If the job times out after 24 hours without completing the training, it is ok to put the weights-and-biases link of that experiment, even if the training was not complete.

MiguelZamoraM commented 1 year ago

I made an announcement on a separate issue regarding training times.

Milkiananas commented 1 year ago

Indeed this demo is more suitable to be trained on PC than servers. It takes ~30min for the dog on my gaming PC...