ggerganov / ggml

Tensor library for machine learning
MIT License
10.81k stars 998 forks source link

Is there a reason why backend couldn't be selected at runtime? #891

Open Please-just-dont opened 1 month ago

Please-just-dont commented 1 month ago

We select the backend at build time by selecting CUDA, Vulkan, SYCL, etc. Wouldn't it be better if you build with the backends you want to support and then select the backend at runtime? It's literally just one runtime if statement and that would make it much easier to compare the performance of the different backends.

slaren commented 1 month ago

Backends often need to link to a shared library that may not be available on the systems without the supported hardware drivers installed. Eg. you can't run the CUDA backend on systems without the CUDA driver. In the future I would like to move the backends to dynamic libraries that can be loaded at runtime, but that's a more complex change than an if statement.

Please-just-dont commented 1 month ago

You can easily have the host-side cpu inference method behind an if statement, right? It would be really convenient to switch it out and see the performance difference. For example I found my Vulkan implementation performs about the same as my CPU with 4 threads.

ngxson commented 1 month ago

Switching backend at runtime requires building all backend in the first place, which is complicated to setup, takes a lot of time and produces big binary size. For the same reason, pytorch offers different packages for CUDA/CPU/ROCm.

Out of the box, ggml comes with CPU + a backend of your choice. ggml_backend_sched interface can be used to do hybrid CPU+backend at the same time. Furthermore, rpc backend allow you to build one ggml "client" for each backend, and use sched to mix and match them. IMO that's already a lot of flexibility.

slaren commented 1 month ago

There is nothing stopping you from building ggml with multiple backends and use all of them with ggml_backend_sched (other than maybe a broken build script). It's just not practical at the moment because for some backends, the resulting binary will fail to run on computers without the corresponding drivers installed.

WilliamTambellini commented 1 month ago

+1 for that feature. At least an easy way to choose at runtime between cuda and cpu backends. It still does nt seem to be doable as today with llama-cli. Could you just ref to the API to call in order to use the cpu backend when building with the ggml-cuda lib ? Best