Closed frpays closed 5 years ago
I have added OpenCL batching support for LZGo in here: https://github.com/Ttl/leela-zero/tree/batching_test
It's missing batching of innerproduct that is not used in LZGo, but otherwise it's basically complete implementation. It shouldn't be too hard to port this for lc0.
I have added OpenCL batching support for LZGo in here: https://github.com/Ttl/leela-zero/tree/batching_test
It's missing batching of innerproduct that is not used in LZGo, but otherwise it's basically complete implementation. It shouldn't be too hard to port this for lc0.
Thank you for point out this to me.
Where you able to join the inner sgemm operations along the batch dimension in the conv3x3 winograd algorithm? This is what yields a 2x performance. See "First step batched gemm" in https://ai.intel.com/winograd-2/ .
Batching the BLAS layer, was basically reordering the M/V temporary matrices dimensions so that all the inner sgemm on a same tile but different batch_size are contiguous in memory . Then you can join all the sgemm operations of a tile into one.
Did you do that ? I am unsure reading the code.
What speed-up do you achieve?
What do you refer as "Proper batching support" ?
//TODO: Proper batching support
Why a some small batch size ? BLAS default batch size is 256. Cudnn is 512.
#define MAX_BATCH 8
I don't think the memory layout is like that. I wasn't aware of that.
I don't have a computer with GPU to test right now. On my laptop gain is zero because even single batch can fully utilize the Intel integrated GPU (It only gets 20 nps). Smaller 64x5 network had 30 % performance increase. I would expect that computer with proper GPU would have larger performance gain.
Proper batching support meaning launching only one kernel instead of doing loop. It's not that critical since convolve1 is very fast compared convolve3.
Is that large batch size stronger on equal time compared against smaller batch size? MCTS is most efficient with batch size of one and everything above it forces choosing potentially worse moves to the batch. You want to use the smallest batch size that can fully utilize GPU for the strongest play. It's also good to keep in mind that Go board has about 5.6 times more squares than chess board so every convolution is much bigger requiring smaller batch size.
I updated the code to increase sgemm dimensions instead of adding batches of smaller multiplications. Also added a proper convolve1 batching implementation.
No performance difference to previous code on this laptop. I should be able to test on better GPU this weekend. I'll update the results here.
You did some weird column major reordering during scatter. I just increased the tiles
dimension to tiles * batch_size
and kept the old order since OpenCL sgemm doesn't support anything else. Is that memory layout faster on CPU? The old order had benefit that one of the matrices was transposed so sgemm could read both matrices in row order. Without transpose one of the matrix needs to be read in column order which isn't that good for memory locality.
Profiling shows that the only operation that matters is the convolution3x3. This operation takes more than 95% of all the computation time.
It's not clear to me, if you joined the inner gemm or not. If you joined the inner gemm, you are definitely entiltled to a 2x speed-up. I will have a look at your code tomorrow or this week-end.
If you increases the tiles
dimensions, this means you did join the batch_size gemm in one sgeem. Actually, it's very simple: how many gemm do you have in the conv3x3. If you still have batch_size * tiles
gemm, you can do better.
Here is what I did in details:
The initial non-batched algorithm looks like that:
foreach batch in batches
foreach tile in tiles
V := scatter (input, batch)
M[tile] := W[batch][time] x V[tile]
output := gather(M, batch)
You can make M and V batch_size times
larger and batch scatter
and gather
.
foreach tile in tiles
V := scatter (input)
foreach tile in tiles
M[batch][tile] := W[tile] x V[batch][tile]
output := gather(M)
With the proper reordering of M and V, the inner multiplication can be contiguous along the columns of M and V. Note that W is fixed and does not change along the batch. So the tiles inner matrix multiplications are joined in one multiplication. The input/output columns are batch_size times larger. We reuse W[batch] for all the tiles, this is what gives us the speedup.
foreach tile in tiles
V := scatter (input)
M[batch]:= W[tile] x V[batch]
output := gather(M)
I left also here a comment about the method here: https://github.com/LeelaChessZero/lc0/pull/87/#issuecomment-397791984
Profiling shows that the only operation that matters is the convolution3x3. This operation takes more than 95% of all the computation time.
I'm talking about the SGEMM performance. You changed CblasRowMajor, CblasTrans, CblasNoTrans
to CblasColMajor, CblasNoTrans, CblasNoTrans
. Is that faster on CPU? On GPU it would be slower due to worse memory access pattern. I kept the original memory layout.
I first had 16 * batch_size
batches of outputs x tiles x channels
SGEMMs, which I changed now to 16 SGEMMs of outputs x (tiles * batch_size) x channels
as you did. The total work is theoretically the same but larger SGEMM dimensions should be more efficient in practice. I'm not sure where you get 2x performance increase from. The approach with more batches doesn't need larger W, its offset is taken mod 16 to reuse it.
Larger SGEMM should be especially faster for chess with 8x8 board that only has 16 tiles and tuning the SGEMM kernel for larger dimensions should result in additional performance gain. Go has 100 tiles, but for practical reasons on GPU it's necessary to zero pad matrix to either 104 or 128 depending on GPU. With larger batch size the amount of padding is smaller which should result in additional performance gain.
I changed both CblasRowMajor
to CblasColMajor
, and CblasTrans
to CblasNoTrans
on the W matrix. Since theses flags have the opposite effect, the memory access layout is the same with respect to W. I did not test to transpose W. I don't know whether or not that would be faster or slower on cpu.
The initial dimensions or M and V are [batches][wtiles][channels][tiles]. I inverted the dimensions 3 and 4, that causes their transpositions in the multiplication. Then I inverted dimensions 1 and 2, with a resulting layout [wtiles][batches][[tiles][channels]
, that enables the merging of the inner sgemm ([wtiles][batches x tiles][channels]
)
I'm talking about the SGEMM performance.
You mentioned "proper convolve1 batching implementation". Whilst intellectually satisfying, I pointed out that this has no bearing on the overall resulting performances.
I first had 16 batch_size batches of outputs x tiles x channels SGEMMs, which I changed now to 16 SGEMMs of outputs x (tiles batch_size) x channels as you did. The total work is theoretically the same but larger SGEMM dimensions should be more efficient in practice. I'm not sure where you get 2x performance increase from.
The total work is the same, but the gemm implementation can possibly reuse the W matrix on a whole batch, hence a possible speed-up. This said, it's perfectly possible that the GPU gets no speed-up. It really depends on the gemm implementation and what the limiting factor is, memory bandwidth or FPU throughput. Even with BLAS, all the vendors will not benefit equally from larger operations.
Some performance tests on GTX 1080 Ti using 19x19 board. Gains should be larger than this on lc0 with smaller 8x8 board.
192x15 network:
Batch size | n/s |
---|---|
1 | 923 |
4 | 1060 |
8 | 1128 |
64x5 network:
Batch size | n/s |
---|---|
1 | 3883 |
4 | 9992 |
8 | 10160 |
20x128 network for 9x9 board. These results should be closer to chess:
Batch size | n/s |
---|---|
1 | 1403 |
4 | 4168 |
8 | 4688 |
16 | 4992 |
128 | 4992 |
I also tried tuning the SGEMM for larger batch size. It didn't change the results for 192x15 network, but with SGEMM tuned for larger batch 64x5 got 11920 n/s at batch size of 8 and 20x128 network for 9x9 board got 5840 n/s at batch size of 16 and 6520 n/s at batch size of 128.
I would expect that this would be about 4 times faster on lc0.
Very nice! I cannot wait to see that ported to lc0.
I just worrying about the tuning: does it changes something ? Shouldn't you check the maximum batch_size against the device capabilities ?
I am not sure the Tuning code is all that robust, as there are so devices that are not supported by the OpenCL backend due to, possibly, large local workgroup size. I am working on this, but it's not clear what to do about this.
batch_size only affects global size which is not limited by the device. Local size is constant. The maximum batch size is only limited by the device memory. Since we don't have to do backpropagation there is no need to store gradients and memory use is very small. Large batch sizes can be used even with devices with small amount of memory.
Tuning for larger batch sizes mainly allows using a larger tile size in SGEMM. There isn't really no point in using 32 wide tile when calculating 16 wide SGEMM since half of the computation would be wasted. With larger batch size it's possible to use a larger tile which can give some performance gains.
I'm not aware of any issues with insufficient local size. The tuner checks that the kernel can be compiled and that it gives the correct output, so it should find the best parameters even if device has some limitations with the local size.
That's very valuable information, thank you. It looks like am destined to port this to lc0, except if someone else volunteers. But it will take me a while, that's for sure. I will keep you informed and probably get back to you if I am stuck.
I implemented F(4x4, 3x3) Winograd transformations for LZGo (https://github.com/gcp/leela-zero/pull/1643). You might want to check them out since they seem to give a very nice performance increase.
Sorry for the delay and many thanks for the hints! Be sure that your outstanding work on LZGo is followed closely but the lc0 community! I will definitely look into your development and probably try to port it to our chess implementation of Leela. Also @borg323 was also very interested to examine your patch closer.
@frpays should this be closed?
Implemented by https://github.com/LeelaChessZero/lc0/pull/191
Use BLAS Batch computation as a model.