NVIDIA-Genomics-Research / GenomeWorks

SDK for GPU accelerated genome assembly and analysis
https://clara-parabricks.github.io/GenomeWorks/
Apache License 2.0
284 stars 76 forks source link

[common] DevicePreallocatedAllocator should allocate exactly the amount of memory requested, not more #600

Closed mimaric closed 3 years ago

mimaric commented 3 years ago

DevicePreallocatedAllocator currently rounds up allocations to the next size divisible by 256. This comes from the property of cudaMalloc() that all its allocations are 256B-aligned (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses). As a consequence of this if the last block in memory has e.g. 300 free bytes and 300B are requested those 300 requested bytes will be rounded up to 512B and the allocation will fail due to insufficient memory (for details see PR #598). The property of aligning allocations to 256B should be kept, but their sizes should not be rounded up. The allocator should internally be aware that the remaining ((requested_size - 1) / 256 + 1) * 256 - requested_size bytes are "junk".

mimaric commented 3 years ago

@tijyojwad FYI