polybeast is slower than monobeast

Da-Capo commented 4 years ago

I build the cuda docker container like this, and tested mono and poly by almost same parameters below:

python -m torchbeast.monobeast \
     --env PongNoFrameskip-v4 \
     --num_actors 64 \
     --total_steps 30000000 \
     --learning_rate 0.0004 \
     --epsilon 0.01 \
     --entropy_cost 0.01 \
     --batch_size 4 \
     --unroll_length 80 \
     --num_buffers 60 \
     --num_threads 4 \
     --xpid example

python -m torchbeast.polybeast \
     --env PongNoFrameskip-v4 \
     --num_actors 64 \
     --total_steps 30000000 \
     --learning_rate 0.0004 \
     --epsilon 0.01 \
     --entropy_cost 0.01 \
     --batch_size 4 \
     --unroll_length 80 \
     --xpid example

I got the result that polybeast is slower than monobeast: monobeast speed is about 10000SPS. polybeast speed is about 3000SPS. I have checked GPU, it works fine. monobeast used 100% of every CPU processor, but polybeast used only 50% of every CPU processor. How can I speed up the polybeast?

heiner commented 4 years ago

Hey Da-Capo, thanks for your report.

I think to start with a batch size of 4 isn't very large. The reason your CPUs are less busy for Polybeast is that the actor forward passes ("inference") happen on the GPU in that case. Options include:

Increase batch size
Use different GPUs for inference and learning
Potentially increase the number of parallel inference and learner threads
You can also play around with the unroll length

MachengShen commented 3 years ago

I'm having similar issue on a ubuntu machine with 32cpu cores, and 4 V100 gpus. with monobeast, it only uses 1 gpu, and full cpu power, the frame rate is ~5000SPS; while with polybeast, I set batch_size=16, num_inference/learner_threads=8, but the frame rate is only ~300SPS, and only 2 gpus are running. Were you able to speed up polybeast? Can you share some insight with me? Thanks!

facebookresearch / torchbeast

polybeast is slower than monobeast #10