Libraries that are opened with dlopen should be closed when they are no longer needed.
Implementation
Previously, I wasn't saving the handles anywhere after calling dlopen. In each worker, instead of opening the library object for every worker that's being loaded, it's loaded once and its handle is saved. Any other worker threads use the same handle to create a new class. Then, when the workers are being unloaded, the handle is closed at the end.
The Tfzendnn backend has extra complications because it also needs to open TF internally. Now, it opens the library first during its own construction before the other member variables are created and then it's closed during destruction.
Notes
This investigation was motivated by the excessive memory usage observed during the resnet50 benchmark from #184 where I ran all the backends at once. Unfortunately, this issue persists still even after fixing these issues. Using the resnet50 benchmark (batch size 1, 4, 64; requests 4, 16, 64; workers 1, 2), I observed ~5Gi, ~4Gi, ~24Gi, and ~45Gi memory usage at peak for migraphx, ptzendnn, tfzendnn and xmodel, respectively, while running each backend individually. In all cases, memory usage slowly increases as the test runs.
Using massif on a short run doesn't show the high memory usage and it takes too long to run on a full test. I also confirmed that the memory pool is not allocating all the extra memory by logging the total allocated memory. memcheck showed a possible leak in rt-engine and other potential issues in TF/ZenDNN but it's unclear if those are enough to explain the memory usage.
Summary of Changes
dlopen
Motivation
Libraries that are opened with
dlopen
should be closed when they are no longer needed.Implementation
Previously, I wasn't saving the handles anywhere after calling
dlopen
. In each worker, instead of opening the library object for every worker that's being loaded, it's loaded once and its handle is saved. Any other worker threads use the same handle to create a new class. Then, when the workers are being unloaded, the handle is closed at the end.The Tfzendnn backend has extra complications because it also needs to open TF internally. Now, it opens the library first during its own construction before the other member variables are created and then it's closed during destruction.
Notes
This investigation was motivated by the excessive memory usage observed during the resnet50 benchmark from #184 where I ran all the backends at once. Unfortunately, this issue persists still even after fixing these issues. Using the resnet50 benchmark (batch size 1, 4, 64; requests 4, 16, 64; workers 1, 2), I observed ~5Gi, ~4Gi, ~24Gi, and ~45Gi memory usage at peak for migraphx, ptzendnn, tfzendnn and xmodel, respectively, while running each backend individually. In all cases, memory usage slowly increases as the test runs.
Using
massif
on a short run doesn't show the high memory usage and it takes too long to run on a full test. I also confirmed that the memory pool is not allocating all the extra memory by logging the total allocated memory.memcheck
showed a possible leak inrt-engine
and other potential issues in TF/ZenDNN but it's unclear if those are enough to explain the memory usage.