GPU accelerated batch MCTS

fqjin / 2048NN

Train a neural network to play 2048

GNU General Public License v3.0

1 stars 1 forks source link

GPU accelerated batch MCTS #10

Closed fqjin closed 5 years ago

fqjin commented 5 years ago

Using pytorch

fqjin commented 5 years ago

Batch moving successfully implemented in cfc92db4f4fa7dcb713d3f44cb711dcaf91247c5. Speed comparisons coming...

fqjin commented 5 years ago

Speed comparisons (in seconds) for various number of games. Result for 1 game is the average of 10 games. Standard deviation in parens.

Batching is slower for a single game, equal at about 10, and significantly faster at 100 games in parallel.
No batch GPU is much slower than on CPU for all cases. In other testing, I found that tensors need to have on the order of 10^6 elements for GPU acceleration to start beating CPU. May be worth examining contribution of play_fixed_batch vs move_batch vs merge_row_batch.
CPU: Intel Xeon E5-1620 @ 3.60 GHz
GPU: GeForce RTX 2070 @ 645 MHz, CUDA 10.1

CPU

n	nobatch	batch
1	0.189 (0.073)	0.468 (0.139)
10	1.70	1.25
100	16.5	3.27
1000	x	17.6

CUDA

n	nobatch	batch
1	0.626 (0.244)	1.14 (0.308)
10	6.19	2.95
100	66.2	9.47
1000	x	63.9

fqjin commented 5 years ago

At 10000 games with 3 addition randomly generated tiles, merge_row_batch is about the same between GPU and CPU, and move_batch is 3 times slower for GPU. The code in move_batch involves iterating and appending lists, and flipping torch tensors.

fqjin commented 5 years ago

Functions are slower on GPU when data size is small. For the function nonzero(), CPU is faster when data has 10^4 elements, and GPU becomes faster when data has more than 10^5 elements. I will implement a batch version of generate_tiles(), but even with 1000 per batch, it is still faster on CPU.

fqjin commented 5 years ago

Timings for mcts_nn with number=100. Running mcts is about 3x slower using cuda tensors. GPU gives about 3x faster evaluation with the CNN. However, given the overhead of running the mcts, running on GPU is still slower. ConvNet game is faster than TestNet because the mcts lines die earlier.

Network	CPU	CUDA
TestNet	11.7	47.3
ConvNet	9.77	26.4

Timings for 'play_nn' which does not do mcts (it only plays 1 game). Even this is slower due to slower move and generate_tiles functions on GPU. I should plan to optimize these functions in the future.

Network	CPU	CUDA
ConvNet	1.13	1.27

fqjin commented 5 years ago

Selfplay game generation is still too slow. However, timing tests suggest that the minimum size for a GPU to be better than CPU is 200,000 batches in parallel. It is very hard for me to reach these numbers when only searching 50 lines per move or 200 games per mcts_nn. I would need to run 1000 mcts_nn in parallel to get that benefit. For now, I am focusing on improving speed on the CPU.

fqjin commented 5 years ago

I was able to use GPU-accelerated model prediction to speed up mcts_nn. See #14 and 64ebf88d303ec06a284a26a6d8bfb46d82b5b2e2 for details. GPU usage hovers around 5% for one process.