beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.15k stars 109 forks source link

Enabling release memory (device memory deallocations) mode after each run from the Execution Plan #444

Closed jjfumero closed 3 weeks ago

jjfumero commented 1 month ago

Description

TornadoVM fully manages device memory, and the way it works is similar to the Java memory management. TornadoVM has a hard limit for the maximum amount of device memory to use. Then, the TornadoVM runtime can allocate as many buffers in that region. Thus, the memory used expands until the maximum limit is reach.

Besides, TornadoVM maintains a list of free and used buffers. Thus, when an execution plan finishes, device buffers are marked as free, but never released (e.g., clMemFree in OpenCL), but rather declare as free for other task-graphs to use the already allocated areas. In the case compaction is needed, TornadoVM deallocs and allocs a new consecutive region. This whole process is fully transparent for the programmer.

However, it might be cases in which programmers would like the TornadoVM runtime to free all resources after an execution plan has finished. This PR adds support for this feature.

If the flag -Dtornado.reuse.device.buffers=False is set, then TornadoVM allocs and deallocs device buffers every time an execution plan is launched. By default, it is set to true (to reuse buffers as much as possible).

Behaviour

To check all JNI calls, including allocations and deallocations, we need to enable the LOG_JNI macro:

diff --git a/tornado-drivers/opencl-jni/src/main/cpp/source/ocl_log.h b/tornado-drivers/opencl-jni/src/main/cpp/source/ocl_log.h
index 94e46bf8d..9079d6c78 100644
--- a/tornado-drivers/opencl-jni/src/main/cpp/source/ocl_log.h
+++ b/tornado-drivers/opencl-jni/src/main/cpp/source/ocl_log.h
@@ -31,7 +31,7 @@
 #define PRINT_DATA_TIMES 0
 #define PRINT_DATA_SIZES 0

-#define LOG_JNI 0
+#define LOG_JNI 1

 #define LOG_OCL_AND_VALIDATE(name, result)                     \
     if (LOG_JNI == 1)  {                                       \
diff --git a/tornado-drivers/ptx-jni/src/main/cpp/source/ptx_log.h b/tornado-drivers/ptx-jni/src/main/cpp/source/ptx_log.h
index 5e0dd6eec..d32fd804e 100644
--- a/tornado-drivers/ptx-jni/src/main/cpp/source/ptx_log.h
+++ b/tornado-drivers/ptx-jni/src/main/cpp/source/ptx_log.h
@@ -26,7 +26,7 @@
 #define TORNADO_PTX_LOG_H

 #include <cuda.h>
-#define LOG_PTX 0
+#define LOG_PTX 1

 #define LOG_PTX_AND_VALIDATE(name, result)                      \
     if (LOG_PTX == 1)  {  
$ tornado-test --printKernel --jvm="-Dtornado.reuse.device.buffers=false" -V uk.ac.manchester.tornado.unittests.foundation.TestFloats#testVectorFloatAdd 

// OpenCL

[TornadoVM-OCL-JNI] Calling : clEnqueueNDRangeKernel -> Status: 0
[TornadoVM-OCL-JNI] Calling : clEnqueueReadBuffer -> Status: 0
[TornadoVM-OCL-JNI] Calling : clFlush -> Status: 0
[TornadoVM-OCL-JNI] Calling : clReleaseMemObject -> Status: 0
[TornadoVM-OCL-JNI] Calling : clReleaseMemObject -> Status: 0
[TornadoVM-OCL-JNI] Calling : clReleaseMemObject -> Status: 0
[TornadoVM-OCL-JNI] Calling : clFlush -> Status: 0

Level Zero:

[TornadoVM-SPIRV-JNI]  Calling : zeCommandListAppendMemoryCopy-[INTEGER] -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandListAppendBarrier -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandListClose -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandQueueExecuteCommandLists -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandQueueSynchronize -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandListReset -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeMemFree -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeMemFree -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeMemFree -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandListClose -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandQueueExecuteCommandLists -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandQueueSynchronize -> Status: 0
[TornadoVM-SPIRV-JNI]  Calling : zeCommandListReset -> Status: 0

PTX:

[TornadoVM-PTX-JNI] Calling : cuLaunchKernel -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuEventRecord -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuEventCreate (beforeEvent) -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuEventCreate (afterEvent) -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuEventRecord -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuMemcpyDtoHMemSeg -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuEventRecord -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuCtxSetCurrent -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuMemFree -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuCtxSetCurrent -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuMemFree -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuCtxSetCurrent -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuMemFree -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuStreamSynchronize -> Status: 0
[TornadoVM-PTX-JNI] Calling : cuStreamSynchronize -> Status: 0
Test: class uk.ac.manchester.tornado.unittests.foundation.TestFloats#testVectorFloatAdd
    Running test: testVectorFloatAdd         ................  [PASS] 

Problem description

n/ a.

Backend/s tested

Mark the backends affected by this PR.

OS tested

Mark the OS where this PR is tested.

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

How to test the new patch?

Any test with the flag -Dtornado.reuse.device.buffers=false:

$ tornado-test --printKernel --jvm="-Dtornado.reuse.device.buffers=false" -V uk.ac.manchester.tornado.unittests.foundation.TestFloats#testVectorFloatAdd 

## all unit-test also are passing
make tests