Theano / libgpuarray

Library to manipulate tensors on the GPU.
Other
189 stars 96 forks source link

Implement sort/argsort #268

Open abergeron opened 7 years ago

abergeron commented 7 years ago

It would be nice to have a couple of functions to sort things. The interface I have in mind would be something like:

int GpuArray_sort_inplace(GpuArray *a, unsigned int numaxes, unsigned int *axes, GpuArray *arg);
int GpuArray_sort(GpuArray *r, GpuArray *a, unsigned int numaxes, unsigned int *axes, GpuArray *arg);

int GpuArray_argsort(GpuArray *r, GpuArray *a, unsigned int numaxes, unsigned int *axes);

The axes parameter would be a list of axes to sort over in the order they are to be considered for sorting priority.

The arg is an optional array which will be filled with the argsort if not NULL.

khaotik commented 7 years ago

For CUDA device, it seems wrapping thrust library is desirable (PyTorch does it). However thrust is a templated, header-only C++ library, not sure how to integrate that into a pure C codebase.

nouiz commented 7 years ago

Good question.

Is the c++ template only used inside the kernel or also outside?

If it is only inside, as it is the driver the compile, maybe we will be fine.

Le dim. 14 mai 2017 23:27, khaotik notifications@github.com a écrit :

For CUDA device, it seems wrapping thrust library is desirable (PyTorch does it). However thrust is a templated C++ library, not sure how to integrate that into a pure C codebase.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Theano/libgpuarray/issues/268#issuecomment-301367999, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-yiQzVk9xhtAdRGR48B7LySFgoIhks5r58YFgaJpZM4KPBaR .

khaotik commented 7 years ago

Did some research, short answer: in-kernel sort support is very limited, not suitable for most cases.

Only sort in single CUDA thread with sequential mode, or launching another kernel via dynamic parallelism is supported.

Related Thread SO thread PyTorch Implementation

nouiz commented 7 years ago

As we want ideally support reduction on arbitrary tensor ndim and arbitrary axis, I don't think we can use Thrust for it.

Did you look into CUB http://nvlabs.github.io/cub/ that allow us to do the kernel and have it do work inside the kernel?

I think CUB is a big dependency. Is it C or C++?

On Tue, May 16, 2017 at 9:31 PM khaotik notifications@github.com wrote:

Did some research, short answer: in-kernel sort support is very limited, not suitable for most cases.

Only sort in single CUDA thread with sequential mode, or launching another kernel via dynamic parallelism is supported.

Related Thread https://devtalk.nvidia.com/default/topic/920842/calling-thrust-sort_by_key-from-kernel/ SO thread http://stackoverflow.com/questions/5510715/thrust-inside-user-written-kernels PyTorch Implementation https://github.com/pytorch/pytorch/blob/master/torch/lib/THC/generic/THCTensorSort.cu

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Theano/libgpuarray/issues/268#issuecomment-301959915, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC--Ev87A0ce9jPeXit2QoDyYMd_1Kks5r6k4AgaJpZM4KPBaR .

nouiz commented 7 years ago

There is also hemi and hemi2 that could be interresting:

https://devblogs.nvidia.com/parallelforall/developing-portable-cuda-cc-code-hemi/ https://devblogs.nvidia.com/parallelforall/simple-portable-parallel-c-hemi-2/

On Wed, May 17, 2017 at 3:08 PM Frédéric Bastien frederic.bastien@gmail.com wrote:

As we want ideally support reduction on arbitrary tensor ndim and arbitrary axis, I don't think we can use Thrust for it.

Did you look into CUB http://nvlabs.github.io/cub/ that allow us to do the kernel and have it do work inside the kernel?

I think CUB is a big dependency. Is it C or C++?

On Tue, May 16, 2017 at 9:31 PM khaotik notifications@github.com wrote:

Did some research, short answer: in-kernel sort support is very limited, not suitable for most cases.

Only sort in single CUDA thread with sequential mode, or launching another kernel via dynamic parallelism is supported.

Related Thread https://devtalk.nvidia.com/default/topic/920842/calling-thrust-sort_by_key-from-kernel/ SO thread http://stackoverflow.com/questions/5510715/thrust-inside-user-written-kernels PyTorch Implementation https://github.com/pytorch/pytorch/blob/master/torch/lib/THC/generic/THCTensorSort.cu

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Theano/libgpuarray/issues/268#issuecomment-301959915, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC--Ev87A0ce9jPeXit2QoDyYMd_1Kks5r6k4AgaJpZM4KPBaR .

khaotik commented 7 years ago

Took a close look. Hemi is more like a wrapper for CUDA language than a library of pre-built algorithms.

CUB looks more promising. Although it's also C++ with temps, it has a in-kernel block radix sort. To reduce dependency, we could try to rewrite that.

abergeron commented 7 years ago

If the templates can be limited to the kernel code, that should be fine for us. However, what I don't like is that we will have to push sort down to the backends since CUB doesn't support OpenCL.

It will also mean that OpenCL will yet again lag behind.