[cudamapper] Use pinned memory and single array in IndexHostCopy

When copying data to or from IndexHostCopy pin host memory. This is currently actually slightly slower than using pageable memory due to high costs of calling cudaHostRegister() and cudaHostUnregister(), but it will bring big performance improvements once once we start overlapping those copies with overlap computations. Using pageable memory would prevent overlapping of communication and computation. In the future we probably also won't be using cudaHostRegister() and cudaHostUnregister() which means that IndexHostMemoryPinner is likely to change significantly.

Merged all arrays in IndexHostCopy into one array. As we'll likely use pool allocator this way we can reduce fragmentation.

Also made a few smaller changes to the way streams and allocators are handled in IndexGPU

Part of #318

NVIDIA-Genomics-Research / GenomeWorks

[cudamapper] Use pinned memory and single array in IndexHostCopy #481