Open abergeron opened 7 years ago
For CUDA device, it seems wrapping thrust library is desirable (PyTorch does it). However thrust is a templated, header-only C++ library, not sure how to integrate that into a pure C codebase.
Good question.
Is the c++ template only used inside the kernel or also outside?
If it is only inside, as it is the driver the compile, maybe we will be fine.
Le dim. 14 mai 2017 23:27, khaotik notifications@github.com a écrit :
For CUDA device, it seems wrapping thrust library is desirable (PyTorch does it). However thrust is a templated C++ library, not sure how to integrate that into a pure C codebase.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Theano/libgpuarray/issues/268#issuecomment-301367999, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-yiQzVk9xhtAdRGR48B7LySFgoIhks5r58YFgaJpZM4KPBaR .
Did some research, short answer: in-kernel sort support is very limited, not suitable for most cases.
Only sort in single CUDA thread with sequential mode, or launching another kernel via dynamic parallelism is supported.
As we want ideally support reduction on arbitrary tensor ndim and arbitrary axis, I don't think we can use Thrust for it.
Did you look into CUB http://nvlabs.github.io/cub/ that allow us to do the kernel and have it do work inside the kernel?
I think CUB is a big dependency. Is it C or C++?
On Tue, May 16, 2017 at 9:31 PM khaotik notifications@github.com wrote:
Did some research, short answer: in-kernel sort support is very limited, not suitable for most cases.
Only sort in single CUDA thread with sequential mode, or launching another kernel via dynamic parallelism is supported.
Related Thread https://devtalk.nvidia.com/default/topic/920842/calling-thrust-sort_by_key-from-kernel/ SO thread http://stackoverflow.com/questions/5510715/thrust-inside-user-written-kernels PyTorch Implementation https://github.com/pytorch/pytorch/blob/master/torch/lib/THC/generic/THCTensorSort.cu
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/Theano/libgpuarray/issues/268#issuecomment-301959915, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC--Ev87A0ce9jPeXit2QoDyYMd_1Kks5r6k4AgaJpZM4KPBaR .
There is also hemi and hemi2 that could be interresting:
https://devblogs.nvidia.com/parallelforall/developing-portable-cuda-cc-code-hemi/ https://devblogs.nvidia.com/parallelforall/simple-portable-parallel-c-hemi-2/
On Wed, May 17, 2017 at 3:08 PM Frédéric Bastien frederic.bastien@gmail.com wrote:
As we want ideally support reduction on arbitrary tensor ndim and arbitrary axis, I don't think we can use Thrust for it.
Did you look into CUB http://nvlabs.github.io/cub/ that allow us to do the kernel and have it do work inside the kernel?
I think CUB is a big dependency. Is it C or C++?
On Tue, May 16, 2017 at 9:31 PM khaotik notifications@github.com wrote:
Did some research, short answer: in-kernel sort support is very limited, not suitable for most cases.
Only sort in single CUDA thread with sequential mode, or launching another kernel via dynamic parallelism is supported.
Related Thread https://devtalk.nvidia.com/default/topic/920842/calling-thrust-sort_by_key-from-kernel/ SO thread http://stackoverflow.com/questions/5510715/thrust-inside-user-written-kernels PyTorch Implementation https://github.com/pytorch/pytorch/blob/master/torch/lib/THC/generic/THCTensorSort.cu
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/Theano/libgpuarray/issues/268#issuecomment-301959915, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC--Ev87A0ce9jPeXit2QoDyYMd_1Kks5r6k4AgaJpZM4KPBaR .
Took a close look. Hemi is more like a wrapper for CUDA language than a library of pre-built algorithms.
CUB looks more promising. Although it's also C++ with temps, it has a in-kernel block radix sort. To reduce dependency, we could try to rewrite that.
If the templates can be limited to the kernel code, that should be fine for us. However, what I don't like is that we will have to push sort down to the backends since CUB doesn't support OpenCL.
It will also mean that OpenCL will yet again lag behind.
It would be nice to have a couple of functions to sort things. The interface I have in mind would be something like:
The
axes
parameter would be a list of axes to sort over in the order they are to be considered for sorting priority.The
arg
is an optional array which will be filled with the argsort if not NULL.