getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 64 forks source link

Help needed in understanding numpy <=> GPU communication #273

Closed gabrielfougeron closed 1 year ago

gabrielfougeron commented 1 year ago

Hi,

I struggle in understanding the exemple code available at https://www.kernel-operations.io/keops/_auto_tutorials/backends/plot_scipy.html .

On the one hand, I see only numpy arrays being defined. Torch is not even imported. On the other hand, I can tell that the computation is being performed on the GPU (as nvidia-smi attests).

How is it possible? When are the memory transfers happening? How can I get more fine-grained control over this?

bcharlier commented 1 year ago

Hi @gabrielfougeron ,

I struggle in understanding the exemple code available at https://www.kernel-operations.io/keops/_auto_tutorials/backends/plot_scipy.html .

On the one hand, I see only numpy arrays being defined. Torch is not even imported.

KeOps can be indeed used with numpy arrays alone (ie we do not rely on pytorch to perform computation on GPU).

On the other hand, I can tell that the computation is being performed on the GPU (as nvidia-smi attests).

That's a good news :)

How is it possible?

No black magic. LazyTensor are just wrapper around a Tensor like classes : it may be a numpy arrays or torch arrays independently. It could be a tensorFlow tensor structure (but this is not implemented)

When are the memory transfers happening?

At the very last moment : the memory transfers between "cpu memory" and "gpu memory" (ie cudaMemCpy) are triggered when doing a large reduction on a LazyTensor. For instance in this command

D = K @ np.ones(N, dtype=dtype)  # Sum along the lines of the adjacency matrix

or when eigsh call the aslinearoprator K.

How can I get more fine-grained control over this?

We worked hard so the user do not have to care about this... Maybe this point can be improved though...

For instance, when using a torch.tensor already in stored in the gpu memory, no copy is needed (we just need the pointer to the tensor data). In case of numpy array, the data is on cpu memory, so I think that a copy is done each time a reduction is performed. @joanglaunes can you confirm that ?

gabrielfougeron commented 1 year ago

Thank you very much @bcharlier for your detailed answer. It challenged my preconceived idea of what happens under the hood when pykeops is running.