Thanks @m-birke for reporting the memory leak and @corepointer for identifying that string scalars are the reason.
DAPHNE does automatic garbage collection to ensure that all data objects (matrices and frames) are freed exactly once after their last use. Scalars are not garbage collected, because they are typically small numerical values and passed by-value. However, string scalars are an exception; we currently represent them as char * in C++, which means we pass the pointer by-value, but the string is backed by an array somewhere in memory. String scalars can be created by DaphneDSL string literals and by operations/kernels that return a string scalar as their result, e.g., concatenation of two strings or casting a number to a string. Currently, these string scalars are never freed, which becomes especially problematic when many new string scalars are created in a loop.
Thus, we need to make sure that all string scalars are freed exactly once, just like matrices and frames. There can be multiple solutions to this problem, but I propose extending the existing garbage collection mechanism to string scalars.
Roughly, we need to:
Adapt the compiler pass for managing object references (src/compiler/lowering/ManageObjRefsPass.cpp). This pass already knows when which SSA value in the IR is used for the last time and needs to be freed. However, currently it only considers data objects (matrices and frames). There are check for these types at a few points in the pass; these checks should also allow mlir::daphne::StringType. With that, the IncRefOps/DecRefOps for string scalars should already get inserted in the right places.
Adapt the IncRefOp/DecRefOp DaphneIR operations, such that they also allow StringScalar as their argument type.
Adapt the incRef/decRef kernels.
The main challenge is that we cannot simply call arg->increaseReferenceCounter() or DataObjectFactory::destroy(arg) on const char *, since plain character arrays obviously don't have a reference counter. I think the simplest way to go is a global mapping from the pointer itself (char *) to its reference counter. This mapping could be stored in the DaphneContext, thereby accessible to the incRef/decRef kernels. Whenever a concrete pointer is not in the mapping, we can assume its reference counter is 1 (with that, we don't need to insert the pointers of all string scalars into the mapping, thereby keeping it smaller and avoiding synchronization overhead). Any modifying access to this mapping must be synchronized as it could happen in multiple threads concurrently (vectorized engine).
These kernels need to be generalized by making the argument data type a template parameter. There should be two specializations: one for Structure (or matrix/frame) with the currently existing implementation, and one for const char * with the implementation sketched above.
src/runtime/local/kernels/kernels.json must be adjusted accordingly.
I think that should solve the problem for now. In the future, we can think of
a more generic solution that generally allows garbage collecting scalars if required by the value type (not the case of any of the current value types but string, but could become relevant with extensibility for value types)
not using char * for string scalars (but this would require quite some refactoring in many places).
Thanks @m-birke for reporting the memory leak and @corepointer for identifying that string scalars are the reason.
DAPHNE does automatic garbage collection to ensure that all data objects (matrices and frames) are freed exactly once after their last use. Scalars are not garbage collected, because they are typically small numerical values and passed by-value. However, string scalars are an exception; we currently represent them as
char *
in C++, which means we pass the pointer by-value, but the string is backed by an array somewhere in memory. String scalars can be created by DaphneDSL string literals and by operations/kernels that return a string scalar as their result, e.g., concatenation of two strings or casting a number to a string. Currently, these string scalars are never freed, which becomes especially problematic when many new string scalars are created in a loop.Thus, we need to make sure that all string scalars are freed exactly once, just like matrices and frames. There can be multiple solutions to this problem, but I propose extending the existing garbage collection mechanism to string scalars.
Roughly, we need to:
src/compiler/lowering/ManageObjRefsPass.cpp
). This pass already knows when which SSA value in the IR is used for the last time and needs to be freed. However, currently it only considers data objects (matrices and frames). There are check for these types at a few points in the pass; these checks should also allowmlir::daphne::StringType
. With that, theIncRefOp
s/DecRefOp
s for string scalars should already get inserted in the right places.IncRefOp
/DecRefOp
DaphneIR operations, such that they also allowStringScalar
as their argument type.incRef
/decRef
kernels.arg->increaseReferenceCounter()
orDataObjectFactory::destroy(arg)
onconst char *
, since plain character arrays obviously don't have a reference counter. I think the simplest way to go is a global mapping from the pointer itself (char *
) to its reference counter. This mapping could be stored in theDaphneContext
, thereby accessible to theincRef
/decRef
kernels. Whenever a concrete pointer is not in the mapping, we can assume its reference counter is 1 (with that, we don't need to insert the pointers of all string scalars into the mapping, thereby keeping it smaller and avoiding synchronization overhead). Any modifying access to this mapping must be synchronized as it could happen in multiple threads concurrently (vectorized engine).Structure
(or matrix/frame) with the currently existing implementation, and one forconst char *
with the implementation sketched above.src/runtime/local/kernels/kernels.json
must be adjusted accordingly.I think that should solve the problem for now. In the future, we can think of
char *
for string scalars (but this would require quite some refactoring in many places).