LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.39k stars 525 forks source link

Simplify setup for multi-GPU systems #1038

Open borg323 opened 4 years ago

borg323 commented 4 years ago

Now that the cudnn-auto pr (https://github.com/LeelaChessZero/lc0/pull/1026) is merged, we can start thinking about simplifying the use of multi-GPU configurations, as mentioned in https://github.com/LeelaChessZero/lc0/issues/1014#issuecomment-559570441.

My sussgetion is to add a --gpu option that works as follows:

Comments?

mooskagh commented 4 years ago

What will happen if neither --gpu nor --backend (nor --backend-opts) is specied? Will it be able to detect automatically number of GPUs?

borg323 commented 4 years ago

I suppose we can add GetNumberOfGPUs() to Network, but how are we going to use it? Maybe with something like --gpu=all? I would prefer to keep current behavior when --gpu is not specified.

mooskagh commented 4 years ago

Yeah, how that would look in the code is what I wanted to ask next.. For now it's just a question whether it would be a useful feature or people with multiple GPUs should know what they do. Especially as it very frequent on laptops to have discreet GPU + builtin GPU, and in that case it's a question whether we should use it built-in one.

From implementation perspective, having user to specify --gpu does look like the best option indeed.

borg323 commented 4 years ago

For cuda GPUs, I suppose it would be OK to detect them all. For opencl I would make GetNumberOfGPUs() return 1 if at least one GPU is detected or 0 if it is CPU only.

mooskagh commented 4 years ago

It's not that easy to have GetNumberOfGPUs() because in order to call it you need to have backend object created. (and also I'm not entirely convinced that it's something that backend class should have, e.g. for random/multiplexing/CPU/network/TPU/whatever it doesn't have any meaning). Maybe adding it to GetNetworkCapabilities() struct would be fine, but then again, we need an object.

So it seems to me that it should be something external that does that.

Or maybe a static function which would return vector of recommended [(priority; backend name, backend-opts); ...] And then auto backend would call all them, pick backends of highest priority (several if they are equal), and create them through multiplexing... Meh, sounds too complicated.

borg323 commented 4 years ago

In my proposal, the backend to use is already known before creating the network, so I think this can be simplified considerably. But better get it working with manual GPU selection first. I think ./lc0 --gpu=0,1 is a big improvement over ./lc0 --backend=multiplexing --backend-opts="(gpu=0),(gpu=1)" --threads=3.

mooskagh commented 4 years ago

So, no automatic detection of number of CPUs. Sounds good.

Maybe go even further and have integer parameter "number of GPUs to use"? That way it's clear in GUI what that means. Yeah, that's a bit controversial..

Naphthalin commented 4 years ago

Maybe go even further and have integer parameter "number of GPUs to use"? That way it's clear in GUI what that means

I would say that this is the ideal solution and should be picked over trying to use all GPUs. If the user is supposed to specify the number of GPUs, there is no need for an easy-to-use diagnostics of how many GPUs are actually used without using lc0 in terminal, which makes for an easier user experience (and I assume that anyone with multiple GPUs will know that they have multiple they want to use).