Questions about agent types and performance

Hello, and thanks for making the code available!

After building torch from source, I started trying to train my own agent by executing, as per the readme:

sh tools/dev.sh

The agent seems to train correctly, but quite slow. Here's the output at epoch 220:

beginning of epoch:  220
available: 321.613 MB, used: 61.891 GB, free: 618.785 MB
EPOCH: 220
Speed: train: 18.4, act: 175.3, buffer_add: 2.7, buffer_size: 65536
Total Time: 116H 05M 22S, 417922s
Total Sample: train: 5.658M, act: 72.941M
[220] Time spent = 2778.70 s
220:grad_norm[ 400]: avg:  25.1127, min:   9.7113[ 336], max:  70.6275[ 374]
220:loss     [ 400]: avg:   5.0668, min:   3.9048[ 311], max:   6.3866[ 209]
epoch 220, eval score: 18.1290, perfect: 0.10, model saved: True

With epoch_len = 400, this works out to about 4.7 seconds per batch update averaged across the training (or 6.9 s for the latest epoch), which seems quite slow as the paper mentions training the agent in 72h, but at this rate it would take an order of magnitude more.

While of course there could be some issue on my end, I would like to ask a few clarifying questions as well, as, it is not clear on the paper to which agent (VDN, IQL, SAD, SAD + AUX) this 72 h estimate refers. I am also not sure about the settings for the experiments reported on the paper, since figure 3 shows training performance over 1.4M epochs, but the provided training examples are 5k epochs (is it possible that the axis is mislabeled?)

With this in mind, below are my questions: a) Do the settings at dev.sh correspond to the VDN baseline of the paper? b) Do the settings at sad_2player.sh correspond to the SAD or SAD+AUX agent from the paper? c) What is the length of each epoch in figure 3 of the paper? d) I haven't yet been able to run the experiment from sad_2player.sh (see #10 ), but is there any reason to expect significantly different training speed between dev.sh and sad_2player.sh once I get both to work?

Thank you, Rodrigo Canaan

a) No, vdn_2player.sh is the file for VDN baseline b) sad_2player.sh is for SAD, aux implementation is currently missing from the implementation. c) In the paper, 1 epoch means 1 gradient step. In the code, one epoch means K number of gradient steps between which evaluation & print will be run. d) The only difference between dev & sad_2player that will impact the speed is the number of GPUs being used. dev.sh uses the default setting in the main.py, which is 1 GPU for training and 1 GPU for act, while sad_2player uses 1 for training and 2 for act. However, both your training & act speed are significant slower. We are mainly concerned about 2 numbers here, Speed: train: 18.4, & buffer_add: 2.7, which means it can process 18.4 episodes/s in training and 2.7 episodes/s in act. (1 episode means trajectory of 1 game) In our experiments / on our hardware, the program runs at 1000 episodes/s in training and ~1000 episodes/s in buffer_add. Looking from the log, it seems that the program has exhausted memory of your machine, and swap is being used, that will make the training & buffer speed significantly slower. Normally the training requires ~250GB ram to run. Also the number of CPU cores is also crucial. If the number of CPU is small, you may want to reduce --num_thread 80 to avoid too much context switching. Our experiments were run on a 40 Core, 80 HyperThread CPU.

facebookresearch / hanabi_SAD

Questions about agent types and performance #9