NVIDIA / gvdb-voxels

Sparse volume compute and rendering on NVIDIA GPUs
Other
672 stars 144 forks source link

RadixSortByByte issues in 1.1.1 #96

Open icoderaven opened 4 years ago

icoderaven commented 4 years ago

Hi!

So I was testing out the library with CUDA 11, and my application starts failing within ActivateBricksGPU when calling the updated RadixSortByByte function at this line.

I get this error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  radix_sort: failed on 2nd step: cudaErrorInvalidConfiguration: invalid configuration argument

Searching for this error on the internet seems to suggest issues with aligned kernel calls within thrust library. Switching back to 10.1 gives no issues. https://github.com/thrust/thrust/issues/936

While I was at it, I realised that the new 1.1.1 implementation seems to use thrust and simply casts the indices to a 64 bit int to perform a (radix) sort on it? Is that correct/inefficient? The paper and the code clearly seems to suggest that the entries in the AUX_BRICK_LEVXYZ should be within very small bounds (specifically, level, which should be 0-5, and the X,Y, and Z values which should only be upto range_res[level]).

NBickford-NV commented 4 years ago

Hi icoderaven,

If it's at all possible, would it be possible for you to put together a minimal repro case of the radix_sort issue you're running into and send that to the Thrust team? I haven't been able to get a clean issue of an issue with radix_sort yet in internal testing (aside from one issue that seemed to be due to the build system, of all things)!

The 1.1.1 implementation of RadixSortByByte should follow what version 1.1 does, I think - the entries of AUX_BRICK_LEVXYZ consist of the level, X, Y, and Z coordinates (as unsigned shorts) of a series of points. So each (level, X, Y, Z) 4-tuple takes up 64 bits of memory, and RadixSortByByte sorts these 4-tuples by interpreting them as 64-bit ints. (The function's a bit confusingly named, but unfortunately I can't change that in this version without introducing an API break.)

It might be possible to save some bits in sizeof(lev) + sizeof(X) + sizeof(Y) + sizeof(Z), but it might add some additional complexity. lev usually ranges from 0-5, but you could hypothetically call Configure with more than 5 levels (GVDB defines MAXLEV to be 10, so 4 bits there). Similarly, we've heard of users working with grid sizes of 2048^3 or larger, so each of these would have to be at least 11 bits. So the radix sort would have to handle elements of at least 37 bits, which is larger than an unsigned int. So, hypothetically, one could implement a method to pack and unpack AUX_BRICK_LEVXYZ 4-tuples into 40-bit elements and implement a custom radix sort routine to sort these 40-bit elements, but it might not be worth it for the time investment (though this is worth reconsidering if radix sorting turns out to be an application bottleneck). There's also the chance that implementing this byte-level modification could increase the constant time factor by enough that it might be slower than using the 64-bit-per-element 4-tuple radix sorting currently used - but there is also the chance that it could be faster (by up to 37.5%, with some unknown factor as a result of not using Thrust for radix sorting) as well.

Thanks!

icoderaven commented 4 years ago

Gotcha! Thanks for the detailed explanation! Sorry for the tardy response, but since I'm a little hard pressed on time, I'm going to stick with CUDA 10.2 where everything works swimmingly for now and investigate this sometime later!