daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
68 stars 62 forks source link

Memory leak due to missing garbage collection of string scalars #788

Closed pdamme closed 3 months ago

pdamme commented 4 months ago

Thanks @m-birke for reporting the memory leak and @corepointer for identifying that string scalars are the reason.

DAPHNE does automatic garbage collection to ensure that all data objects (matrices and frames) are freed exactly once after their last use. Scalars are not garbage collected, because they are typically small numerical values and passed by-value. However, string scalars are an exception; we currently represent them as char * in C++, which means we pass the pointer by-value, but the string is backed by an array somewhere in memory. String scalars can be created by DaphneDSL string literals and by operations/kernels that return a string scalar as their result, e.g., concatenation of two strings or casting a number to a string. Currently, these string scalars are never freed, which becomes especially problematic when many new string scalars are created in a loop.

Thus, we need to make sure that all string scalars are freed exactly once, just like matrices and frames. There can be multiple solutions to this problem, but I propose extending the existing garbage collection mechanism to string scalars.

Roughly, we need to:

  1. Adapt the compiler pass for managing object references (src/compiler/lowering/ManageObjRefsPass.cpp). This pass already knows when which SSA value in the IR is used for the last time and needs to be freed. However, currently it only considers data objects (matrices and frames). There are check for these types at a few points in the pass; these checks should also allow mlir::daphne::StringType. With that, the IncRefOps/DecRefOps for string scalars should already get inserted in the right places.
  2. Adapt the IncRefOp/DecRefOp DaphneIR operations, such that they also allow StringScalar as their argument type.
  3. Adapt the incRef/decRef kernels.
    • The main challenge is that we cannot simply call arg->increaseReferenceCounter() or DataObjectFactory::destroy(arg) on const char *, since plain character arrays obviously don't have a reference counter. I think the simplest way to go is a global mapping from the pointer itself (char *) to its reference counter. This mapping could be stored in the DaphneContext, thereby accessible to the incRef/decRef kernels. Whenever a concrete pointer is not in the mapping, we can assume its reference counter is 1 (with that, we don't need to insert the pointers of all string scalars into the mapping, thereby keeping it smaller and avoiding synchronization overhead). Any modifying access to this mapping must be synchronized as it could happen in multiple threads concurrently (vectorized engine).
    • These kernels need to be generalized by making the argument data type a template parameter. There should be two specializations: one for Structure (or matrix/frame) with the currently existing implementation, and one for const char * with the implementation sketched above.
    • src/runtime/local/kernels/kernels.json must be adjusted accordingly.

I think that should solve the problem for now. In the future, we can think of