facebookresearch / torchbeast

A PyTorch Platform for Distributed RL
Apache License 2.0
734 stars 113 forks source link

polybeast is slower than monobeast #10

Closed Da-Capo closed 4 years ago

Da-Capo commented 4 years ago

I build the cuda docker container like this, and tested mono and poly by almost same parameters below:

python -m torchbeast.monobeast \
     --env PongNoFrameskip-v4 \
     --num_actors 64 \
     --total_steps 30000000 \
     --learning_rate 0.0004 \
     --epsilon 0.01 \
     --entropy_cost 0.01 \
     --batch_size 4 \
     --unroll_length 80 \
     --num_buffers 60 \
     --num_threads 4 \
     --xpid example

python -m torchbeast.polybeast \
     --env PongNoFrameskip-v4 \
     --num_actors 64 \
     --total_steps 30000000 \
     --learning_rate 0.0004 \
     --epsilon 0.01 \
     --entropy_cost 0.01 \
     --batch_size 4 \
     --unroll_length 80 \
     --xpid example

I got the result that polybeast is slower than monobeast: monobeast speed is about 10000SPS. polybeast speed is about 3000SPS. I have checked GPU, it works fine. monobeast used 100% of every CPU processor, but polybeast used only 50% of every CPU processor. How can I speed up the polybeast?

heiner commented 4 years ago

Hey Da-Capo, thanks for your report.

I think to start with a batch size of 4 isn't very large. The reason your CPUs are less busy for Polybeast is that the actor forward passes ("inference") happen on the GPU in that case. Options include:

MachengShen commented 3 years ago

I'm having similar issue on a ubuntu machine with 32cpu cores, and 4 V100 gpus. with monobeast, it only uses 1 gpu, and full cpu power, the frame rate is ~5000SPS; while with polybeast, I set batch_size=16, num_inference/learner_threads=8, but the frame rate is only ~300SPS, and only 2 gpus are running. Were you able to speed up polybeast? Can you share some insight with me? Thanks!