LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.44k stars 528 forks source link

Make lc0 work efficiently on multi-core CPU (for NN inferrence) #35

Closed mooskagh closed 4 years ago

mooskagh commented 6 years ago

That's planned to be hopefully done before TCEC. Tracking issue so that I don't forget it.

frpays commented 6 years ago

What's your ideas on this subject? Have you started anything?

mooskagh commented 6 years ago

My idea is opposite of network_mux (namely, network_demux). It runs multiple worker threads and when batch compute comes, it splits large batch into N smaller batches to compute in parallel. I didn't do anything for that yet.

In future I'm thinking about smart network_dispatcher which will get multiple backends (CPU, opencl, cudnn), collects stats about them dynamically and tries to distribute work between them to maximize total throughput, but I doubt that will happen before TCEC.

frpays commented 6 years ago

Addressing only TCEC deadline, what about a multi-threaded cpu-only backend? Like network_blas, but computing the batch in parallel with n predefined/pre-started threads. With 44 cores maybe, use the mux over 4-6 backends of 10 threads each. It's not clear to me how to maximize cpu utilization for such setup.

mooskagh commented 6 years ago

It's cleaner to have a separate demuxing layer. Also current muxer is not very efficient (it's often better to wait few ms to gather larger batch), so it will have to be removed. Also any solution with more than 2-3 mcts threads working on the same game is bad as usually they cannot distribute work properly which degrades performance.

frpays commented 6 years ago

So addressing the TCEC deadline again (when, july ?). Could a short term solution be a sort of network demux sitting over a large threadpool and using another network (blas) to create computations as the inputs come with batch size=1?

mooskagh commented 6 years ago

I don't know when TCEC will happen. Yes, solution would be to have demux network, which has fixed pool worker threads computing requests by splitting it into smaller batches which are computed in parallel. For example, CPU+blas backends of batch size=1, or that can also be used with multiple GPU case, where huge batch of size for example 512 could be split into 8 smaller batches of 64.

frpays commented 6 years ago

I made some tests. Apparently it's pointless to parallelise on batch size=1. Looks like the only way ahead is implementing batch computations (on blas), and then I suppose you can take advantage of multi-core.

frpays commented 6 years ago

I was able to improve network blas (mkl) by only 40% using batching, which quite disappointing. Apparently for batched-blas, the optimal parameters are around 4 cores and batch_size 32 (That gets only 140 nps on my linux box). Pretty grim.

frpays commented 6 years ago

I am happy to report a breakthrough in batching the computations for BLAS.

I now get:

Give me a few days to get back to you with full benchmark and code.

jjoshua2 commented 6 years ago

Is that compared to old lc0 or lczrro. Both would be useful numbers for reference.

On Wed, Jun 13, 2018, 5:24 PM François Pays notifications@github.com wrote:

I am happy to report a breakthrough in batching the computations for BLAS.

I now get:

  • x2.7 speep-up, from 55nps to 155nps with OpenBlas (Linux, 6 cores, X5675 @ 3.07GHz)
  • x2.4 speed-up, from 76nps to 185nps with Apple VecLib/Accelerate on my MacBook pro.
  • x1.4 speed-up, from 114nps to 161nps with MKL (Linux, 6 cores, X5675 @ 3.07GHz)

Give me a few days to get back to you with full benchmark and code.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LeelaChessZero/lc0/issues/35#issuecomment-397091727, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INPysZF-C0Pryhtwdx5FRZNn40pnoks5t8YL-gaJpZM4UZEtx .

frpays commented 6 years ago

That's compared to current lc0. That's a long time a did not launch lczero to be honest.

frpays commented 6 years ago

The dev is complete. Please test and review. Batch computation for BLAS. https://github.com/LeelaChessZero/lc0/pull/87

mooskagh commented 5 years ago

I think demux + blas can already be good for that. Does anyone have multicore CPU to test?

frpays commented 5 years ago

I have 12 real core. What’s the test procedure?

On Tue 1 Jan 2019 at 23:00, Alexander Lyashuk notifications@github.com wrote:

I think demux + blas can already be good for that. Does anyone have multicore CPU to test?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/LeelaChessZero/lc0/issues/35#issuecomment-450760146, or mute the thread https://github.com/notifications/unsubscribe-auth/ADefjGYJegdSbD8nsLb-U4xpMs3sGauLks5u-9qEgaJpZM4UZEtx .

mooskagh commented 5 years ago

Something like --backend=demux --backend-opts=backend=blas,a,b,c,d,e,f,g,h It will create 8 blas backends and will split batches between them (so default batches of 256 should be fine)

frpays commented 5 years ago

With ID11248, go nodes 20000

1 blas backend for comparison.

./lc0 --backend=blas
Creating backend [blas]...
BLAS, maximum batch size set to 256
BLAS vendor: MKL.
(...)
info depth 6 seldepth 13 time 46756 nodes 958 score cp 23 hashfull 5 nps 20 tbhits 0 pv g1f3 d7d5 d2d4 g8f6 c1f4 c8f5 c2c4 e7e6 b1c3 f8b4 e2e3 e8g8 f1e2
(...)
info depth 7 seldepth 26 time 159295 nodes 5097 score cp 23 hashfull 20 nps 31 tbhits 0 pv g1f3 d7d5 d2d4 g8f6 c1f4 c8f5 c2c4 e7e6 b1c3 f8b4 e2e3 e8g8 f1e2 f6e4 d1b3 b8c6 c4d5 e6d5 h2h3 b4c3
(...)
info depth 8 seldepth 33 time 517611 nodes 20007 score cp 24 hashfull 70 nps 38 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 b1d2 d7d5 f1g2 e8g8 g1f3 d5c4 a2a3 b4d2 c1d2 b8c6 e2e3 a8b8 d1c2

With --backend=demux --backend-opts=backend=blas,a,b,c,d,e,f,g,h

info depth 6 seldepth 15 time 11675 nodes 983 score cp 24 hashfull 5 nps 84 tbhits 0 pv g1f3 d7d5 d2d4 g8f6 c1f4 c8f5 c2c4 e7e6 b1c3 f8b4 e2e3 e8g8 f1e2
(...)
info depth 7 seldepth 25 time 41152 nodes 4940 score cp 24 hashfull 19 nps 120 tbhits 0 pv g1f3 d7d5 d2d4 g8f6 c1f4 c8f5 c2c4 e7e6 b1c3 f8b4 e2e3 e8g8 f1e2 f6e4 d1b3 b8c6 c4d5 e6d5 h2h3 b4c3
(...)
info depth 8 seldepth 33 time 136825 nodes 20005 score cp 24 hashfull 69 nps 146 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 b1d2 d7d5 f1g2 e8g8 g1f3 d5c4 a2a3 b4d2 c1d2 b8c6 e2e3 b7b5 b2b3 c4b3

That's x3.8 for 8 cores.

oscardssmith commented 5 years ago

Should we close this now?