Open leejet opened 1 year ago
I am not sure if you are already doing this, but the CUDA backend currently requires a lot of manual changes to move the tensors to VRAM. The only example of how to do this currently AFAIK is in llama.cpp
. Also keep in mind that these operations are in some cases asynchronous, so you cannot really measure its timings in this way. You can use a tool such as nsight systems instead.
Yes, I have copied the necessary tensors to VRAM. It seems I did overlook that some CUDA operations are asynchronous. I will reprofile using Nsight.
I performed a simple profile on ggml_cuda_op and found that the time spent on memory copying is several times more than the computation time. This is because not all operators have CUDA versions, so during computation, data is frequently copied between the GPU and CPU, which consumes a lot of time. Here's the data: