Currently, the most expensive part of GPU based sampling is running to_block(), and specifically hash table insertions. The current implementation cuda_hashtable.cuh does not make good use of the hardware, and instead we should replace it the implementation in cuCollections.
🚀 Feature
Currently, the most expensive part of GPU based sampling is running to_block(), and specifically hash table insertions. The current implementation cuda_hashtable.cuh does not make good use of the hardware, and instead we should replace it the implementation in cuCollections.
See https://developer.nvidia.com/blog/maximizing-performance-with-massively-parallel-hash-maps-on-gpus/ for a more in-depth explanation.