Some possible optimisations

So, I was reading the best practices guide and came across these possible optimisations

Use pinned memory transfers for faster transfers (see this), tho we'll possibly need to see the limitations regarding this.
Use cudaMemcpyAsync instead of cudaMemcpy which is blocking. Since we're only transferring to device, and the kernel will obviously wait for the transfers to finish, this should be an easy change. (Apparently it requires pinned memory, might be an issue)
Occupancy can also be a big factor, and can be tuned by changing the number of threads per blocks.The Occupancy Calculator can be quite helpful for this

arpit-saxena / hashTableCuda