Closed fqjin closed 5 years ago
Batch moving successfully implemented in cfc92db4f4fa7dcb713d3f44cb711dcaf91247c5. Speed comparisons coming...
Speed comparisons (in seconds) for various number of games. Result for 1 game is the average of 10 games. Standard deviation in parens.
play_fixed_batch
vs move_batch
vs merge_row_batch
.CPU
n | nobatch | batch |
---|---|---|
1 | 0.189 (0.073) | 0.468 (0.139) |
10 | 1.70 | 1.25 |
100 | 16.5 | 3.27 |
1000 | x | 17.6 |
CUDA
n | nobatch | batch |
---|---|---|
1 | 0.626 (0.244) | 1.14 (0.308) |
10 | 6.19 | 2.95 |
100 | 66.2 | 9.47 |
1000 | x | 63.9 |
At 10000 games with 3 addition randomly generated tiles, merge_row_batch
is about the same between GPU and CPU, and move_batch
is 3 times slower for GPU. The code in move_batch
involves iterating and appending lists, and flipping torch tensors.
Functions are slower on GPU when data size is small. For the function nonzero()
, CPU is faster when data has 10^4 elements, and GPU becomes faster when data has more than 10^5 elements. I will implement a batch version of generate_tiles()
, but even with 1000 per batch, it is still faster on CPU.
Timings for mcts_nn
with number=100
. Running mcts is about 3x slower using cuda tensors. GPU gives about 3x faster evaluation with the CNN. However, given the overhead of running the mcts, running on GPU is still slower. ConvNet game is faster than TestNet because the mcts lines die earlier.
Network | CPU | CUDA |
---|---|---|
TestNet | 11.7 | 47.3 |
ConvNet | 9.77 | 26.4 |
Timings for 'play_nn' which does not do mcts (it only plays 1 game). Even this is slower due to slower move
and generate_tiles
functions on GPU. I should plan to optimize these functions in the future.
Network | CPU | CUDA |
---|---|---|
ConvNet | 1.13 | 1.27 |
Selfplay game generation is still too slow. However, timing tests suggest that the minimum size for a GPU to be better than CPU is 200,000 batches in parallel. It is very hard for me to reach these numbers when only searching 50 lines per move or 200 games per mcts_nn. I would need to run 1000 mcts_nn in parallel to get that benefit. For now, I am focusing on improving speed on the CPU.
I was able to use GPU-accelerated model prediction to speed up mcts_nn. See #14 and 64ebf88d303ec06a284a26a6d8bfb46d82b5b2e2 for details. GPU usage hovers around 5% for one process.
Using pytorch