apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.36k stars 3.49k forks source link

[C++] Invoke static initialization of arrow C++ library #42138

Open mrboojum opened 3 months ago

mrboojum commented 3 months ago

TLDR: How can I trigger static initialization of the arrow C++ library?

Context: Using apache arrow in a c++ application running on large multi NUMA machines. The application does have its own heap framework (delegating actual allocation to je_malloc/OS heaps) to prevent/minimize cross NUMA access. We would like to be able to trigger all allocations needed for static variables on a specific heap.

Example: Is there a general way to trigger all allocations needed for static variables like the ones below?

#define TYPE_FACTORY(NAME, KLASS)                                        \
  const std::shared_ptr<DataType>& NAME() {                              \?
    static std::shared_ptr<DataType> result = std::make_shared<KLASS>(); \
    return result;                                                       \
  }

TYPE_FACTORY(null, NullType)
TYPE_FACTORY(boolean, BooleanType)
TYPE_FACTORY(int8, Int8Type)
TYPE_FACTORY(uint8, UInt8Type)
TYPE_FACTORY(int16, Int16Type)
TYPE_FACTORY(uint16, UInt16Type)
TYPE_FACTORY(int32, Int32Type)
TYPE_FACTORY(uint32, UInt32Type)
TYPE_FACTORY(int64, Int64Type)
TYPE_FACTORY(uint64, UInt64Type)
TYPE_FACTORY(float16, HalfFloatType)
TYPE_FACTORY(float32, FloatType)
TYPE_FACTORY(float64, DoubleType)
TYPE_FACTORY(utf8, StringType)
TYPE_FACTORY(large_utf8, LargeStringType)
TYPE_FACTORY(binary, BinaryType)
TYPE_FACTORY(large_binary, LargeBinaryType)
TYPE_FACTORY(date64, Date64Type)
TYPE_FACTORY(date32, Date32Type)

Component(s)

C++

pitrou commented 3 months ago

Well, a slightly kludgy solution is to invoke all those functions one by one :-) (they are extremely cheap)

But what I don't understand is why you would to statically initialize all these, if your memory allocator is NUMA-aware anyway.

mrboojum commented 3 months ago

Thanks for the quick response. Do I understand correctly that arrow doesn't have a function/method for initializing the static variables?

Doing this on the application side is indeed not desirable due to not being able to check completeness of all allocations for static variables and maintainability. Regarding completeness we now have the list below, can you indicate if its complete (we don't detect any issues anymore)? image

Regarding the memory allocation in general I have some more questions: Is it possible/common practice to provide your own subclass of MemoryPool to pass to the API methods (assuming its passed on)?

pitrou commented 3 months ago

Doing this on the application side is indeed not desirable due to not being able to check completeness of all allocations for static variables and maintainability.

What do you mean with "check completeness" exactly?

Is it possible/common practice to provide your own subclass of MemoryPool to pass to the API methods (assuming its passed on)?

Not very common, but metadata allocations (such as data types) go directly to the standard C++ allocator anyway.

mrboojum commented 3 months ago

What do you mean with "check completeness" exactly? -> With completeness I mean that we don't know if the above list of function calls will invoke all allocations for static variables in the arrow library. This implies that if somewhere in the application arrow functionality might trigger an allocation for a different static variable (not in the list above yet) this results in heap leaks/cross NUMA memory access.

"but metadata allocations (such as data types) go directly to the standard C++ allocator anyway." Standard c++ allocators will invoke new which is overloaded by the app. The reason I ask about providing an implementation of MemoryPool is twofold:

  1. the app depends on specific old version/configuration of jemalloc and uses the default vcpkg version 14 of arrow (that is no specific compilation for jemalloc as mentioned in the doc ). I expect that building of arrow for jemalloc won't work nicely with the app's custom setup.
  2. the app's heap allocation framework is "blind" for the allocations of arrow made by the MemoryPool, that is they might end up in the wrong numa node. Perhaps this can be resolved by providing an implementation of MemoryPool