apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.23k stars 3.47k forks source link

Arrow Flight memory management #13837

Open Jokser opened 2 years ago

Jokser commented 2 years ago

Hi, everybody,

Recently, I noticed that my Arrow Flight Server keeps a lot of resident memory (hundred of gigabytes) after ingestion of record batches via DoPut call. After debugging I found out that this memory is used in arrow memory pools.

I tried to get memory stats from default and system pools, but it seems that stats are broken:

                    spdlog::info("[Default] Memory usage: peak = {}, allocated = {}", arrow::default_memory_pool()->max_memory(), arrow::default_memory_pool()->bytes_allocated());
                    spdlog::info("[System] Memory usage: peak = {}, allocated = {}", arrow::system_memory_pool()->max_memory(), arrow::system_memory_pool()->bytes_allocated());

The output of these commands always shows 0 for peak and allocated.

However, when I manually call arrow::default_memory_pool()->ReleaseUnused(); resident memory of the process is freed.

So I have 2 questions

  1. Is there any way to control the memory usage of memory pools other than periodically explicitly releasing pools after some idle time?
  2. Is there any issue with broken memory pool stats? I didn't find anything related to it in jira.

Thank you in advance.

Arrow version: 6.0.1

rok commented 2 years ago

Thanks for reporting this @Jokser! Is it possible you're hitting ARROW-16697?

Could you try the proposed:

export GLIBC_TUNABLES=glibc.malloc.trim_threshold=524288
Jokser commented 2 years ago

@rok I tried it but without success.

The memory usage goes down only after explicit system pool release call:

spdlog::info("RSS before default pool release: {} bytes", getCurrentRSS());
arrow::default_memory_pool()->ReleaseUnused();
spdlog::info("RSS after default pool release: {} bytes", getCurrentRSS());
arrow::system_memory_pool()->ReleaseUnused();
spdlog::info("RSS after system pool release: {} bytes", getCurrentRSS());
[2022-08-10 13:01:47.281] [se_logger] [info] RSS before default pool release: 29080805376 bytes
[2022-08-10 13:01:47.284] [se_logger] [info] RSS after default pool release: 29082968064 bytes
[2022-08-10 13:01:48.627] [se_logger] [info] RSS after system pool release: 6827900928 bytes

It was cut-off of ~22Gb. But there is still ~7Gb of data in RSS.

rok commented 2 years ago

@Jokser could you say more about the system you're on on how your arrow was installed/built?

Paging @lidavidm if he has any insights.

Jokser commented 2 years ago

Okay, It seems that the remaining data in RSS is not directly related to arrow.

Let me describe my data flow:

I have a generated tpcds dataset in Parquet format (with 10/100/1000GB scale factor). A tool writes parquet files into my Arrow Flight backed server opening up to 128 connections (concurrent doPut calls). Each arrow::RecordBatch consumed from doPut stream goes to async transformation to some internal format. The lifetime of arrow::RecordBatch object is short.

What I see in the case of 1000GB dataset. The peak memory usage is ~128GB. After tool work is finished and all connections are closed I still see ~128GB in RSS. When I do manual call arrow::system_memory_pool()->ReleaseUnused(); almost all memory is freed:

[2022-08-10 13:36:22.402] [se_logger] [info] RSS before default pool release: 133422698496 bytes
[2022-08-10 13:36:22.405] [se_logger] [info] RSS after default pool release: 133424791552 bytes
[2022-08-10 13:36:28.965] [se_logger] [info] RSS after system pool release: 5903564800 bytes

I build gRPC/Protobuf and Arrow from sources. Here is a snippet of how we build it: https://gist.github.com/Jokser/268a82428ceb00144519825029a469d7 Here is a snippet of environment https://gist.github.com/Jokser/cac5fe5be9a40bebd88a8f222be745bc

lidavidm commented 2 years ago

The pools only track memory allocated by Arrow for the purpose of data buffers, so it would make sense that they mostly show 0. (For Flight, we try to reuse gRPC's allocations, so the memory isn't tracked by pools.) But ReleaseUnused just calls malloc_trim for you, so it works regardless of who actually allocated the memory. If releasing memory in the system pool is freeing up memory, it's quite likely gRPC's allocations (as rok has already mentioned).

From what I see in https://man7.org/linux/man-pages/man3/mallopt.3.html, the trim_threshold may not have an effect if allocations are large enough that malloc uses mmap instead of sbrk. But a manual malloc_trim would still have an effect. That may explain what's going on here.

If memory usage is a concern, it may be worth keeping a manual malloc_trim at the end of an RPC call/end of the transformation?

Jokser commented 2 years ago

@lidavidm Thank you for your explanation. Just a question regarding allocations. What if I do malloc_trim only after some time of Arrow Flight inactivity? If I keep memory not released as long as possible, should it help with subsequent arrow:RecordBatch allocations? E.g. reduce page_faults?

lidavidm commented 2 years ago

It's going to depend on the allocator, but going with the assumption that this is gRPC-allocated memory, it should help (I would assume glibc tries to make use of previous allocations and the docs mention free lists).