[common] DevicePreallocatedAllocator should allocate exactly the amount of memory requested, not more

DevicePreallocatedAllocator currently rounds up allocations to the next size divisible by 256. This comes from the property of cudaMalloc() that all its allocations are 256B-aligned (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses). As a consequence of this if the last block in memory has e.g. 300 free bytes and 300B are requested those 300 requested bytes will be rounded up to 512B and the allocation will fail due to insufficient memory (for details see PR #598). The property of aligning allocations to 256B should be kept, but their sizes should not be rounded up. The allocator should internally be aware that the remaining ((requested_size - 1) / 256 + 1) * 256 - requested_size bytes are "junk".

NVIDIA-Genomics-Research / GenomeWorks

[common] DevicePreallocatedAllocator should allocate exactly the amount of memory requested, not more #600