Currently, the lifetime of the device arrays is determined by the VM GC. The underlying off-heap memory is freed in a finalizer. The GC is only aware of the on-heap stub object, whose size is independent of array size and, typically, significantly smaller. Hence, device arrays ready to be garbage collected, are not readily freed. This in turn prevents CUDA-managed memory from being reclaimed and also superfluous device-to-host paging traffic of memory pages that will be be accessed in the future.
Proposed Workaround (barring changes in how the GC manages objects that use off-heap resources):
Add free() method to DeviceArray to allow users to explicitly free the CUDA-managed memory of the device array. This separates the lifetime of the device array object and its off-heap buffer. Once freed, the device array remains in a defunct state. Any access will fail with an exception.
This requires an check on every access. However, given that grCUDA currently is a single-threaded Truffle language, this check should be cheap. The Graal compiler should be able to hoist it out outside critical loops.
Currently, the lifetime of the device arrays is determined by the VM GC. The underlying off-heap memory is freed in a finalizer. The GC is only aware of the on-heap stub object, whose size is independent of array size and, typically, significantly smaller. Hence, device arrays ready to be garbage collected, are not readily freed. This in turn prevents CUDA-managed memory from being reclaimed and also superfluous device-to-host paging traffic of memory pages that will be be accessed in the future.
Proposed Workaround (barring changes in how the GC manages objects that use off-heap resources): Add
free()
method toDeviceArray
to allow users to explicitly free the CUDA-managed memory of the device array. This separates the lifetime of the device array object and its off-heap buffer. Once freed, the device array remains in a defunct state. Any access will fail with an exception. This requires an check on every access. However, given that grCUDA currently is a single-threaded Truffle language, this check should be cheap. The Graal compiler should be able to hoist it out outside critical loops.