apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.65k stars 3.56k forks source link

[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

Open jonded94 opened 3 weeks ago

jonded94 commented 3 weeks ago

I have a ~1.5TiB, ~1.7k files parquet dataset with an additional _metadata.parquet file containing metadata of all row groups. The _metadata file was written with the mechanism described in the documentation.

The _metadata file is ~390MiB, the 1.7k parquet files are around 900MiB each.

I have a script called read_metadata.py which can be used to iterate through all files in the dataset, get their metadata and simultaneously measure memory load (RSS):

import gc
import time
import json
from contextlib import contextmanager
from pathlib import Path

import psutil
import pyarrow
import pyarrow.parquet

process = psutil.Process()

@contextmanager
def profiling(name: str):
    start = time.monotonic()
    start_mem = process.memory_info().rss / 1024**2
    yield
    end = time.monotonic()
    end_mem = process.memory_info().rss / 1024**2
    duration = end - start
    if end_mem - start_mem == 0:
        return
    print(
        f"{name}\n"
        f" took {duration:.5f} s, "
        f"mem diff {end_mem - start_mem:.3f}MiB [start: {start_mem:.3f}MiB, end: {end_mem:.3f}MiB]"
    )

def read_metadata(path: Path) -> None:
    pyarrow.parquet.read_metadata(path)
    return

if __name__ == "__main__":
    import argparse
    import random

    parser = argparse.ArgumentParser()
    parser.add_argument("files", nargs="+")
    parser.add_argument("repeats", type=int)
    args = parser.parse_args()

    paths = args.files
    repeats = args.repeats

    paths = paths * repeats
    random.shuffle(paths)

    for path in paths:
        with profiling(f"load: {path}"):
            read_metadata(path)

        with profiling(f"gc:   {path}"):
            gc.collect()

Doing that gives these results (note that only steps where there is a change in memory load are printed):

$ python scripts/read_metadata.py repartition/* 3
load: repartition/part-52.parquet
 took 0.00132 s, mem diff 1.500MiB [start: 194.281MiB, end: 195.781MiB]
load: repartition/_metadata.parquet
 took 2.38347 s, mem diff 2082.062MiB [start: 195.781MiB, end: 2277.844MiB]
load: repartition/_metadata.parquet
 took 2.15518 s, mem diff 16.094MiB [start: 2277.844MiB, end: 2293.938MiB]
load: repartition/part-1587.parquet
 took 0.00099 s, mem diff 1.500MiB [start: 2293.938MiB, end: 2295.438MiB]
gc:   repartition/part-1299.parquet
 took 0.00217 s, mem diff -1.832MiB [start: 2295.438MiB, end: 2293.605MiB]
load: repartition/_metadata.parquet
 took 2.15186 s, mem diff 0.562MiB [start: 2293.605MiB, end: 2294.168MiB]
gc:   repartition/part-1112.parquet
 took 0.00220 s, mem diff -1.543MiB [start: 2294.168MiB, end: 2292.625MiB]
load: repartition/part-751.parquet
 took 0.00111 s, mem diff 1.500MiB [start: 2292.625MiB, end: 2294.125MiB]
load: repartition/part-321.parquet
 took 0.00113 s, mem diff 1.500MiB [start: 2294.125MiB, end: 2295.625MiB]

image

Memory load stays mostly constant, but as soon as the metadata.parquet file is read, a huge, over 2GiB large memory leak appears. Reading that particular file multiple times does not lead to multiple memory leaks.

There seems to be no way to reduce memory load back to normal levels again, not even a pool = pyarrow.default_memory_pool(); pool.release_unused() or gc.collect() does help.

Component(s)

Parquet, Python

pitrou commented 2 weeks ago

Thanks for the report @jonded94 . Two questions: 1) which platform are you running this on? 2) did you try changing the default memory pool? See https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL

jonded94 commented 1 week ago

which platform are you running this on?

$ uname -a
 6.6.30-flatcar #1 SMP PREEMPT_DYNAMIC Sun May 19 16:12:26 -00 2024 x86_64 GNU/Linux
$ python -c "import pyarrow; print(pyarrow.__version__)"
17.0.0

did you try changing the default memory pool?

"system" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py repartition/* 3
load: repartition/part-332.parquet
 took 0.00124 s, mem diff 1.500MiB [start: 194.336MiB, end: 195.836MiB]0, 0
load: repartition/_metadata
 took 2.41739 s, mem diff 2077.484MiB [start: 195.836MiB, end: 2273.320MiB]0, 0
load: repartition/_metadata
 took 2.25331 s, mem diff 400.500MiB [start: 2273.320MiB, end: 2673.820MiB]0, 0
load: repartition/part-759.parquet
 took 0.00094 s, mem diff 1.500MiB [start: 2673.820MiB, end: 2675.320MiB]0, 0

=> also ~2.3GiB leak at first read, but a second read even increases it to ~2.7GiB.

"jemalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-58.parquet
 took 0.00122 s, mem diff 1.500MiB [start: 198.316MiB, end: 199.816MiB]0, 0
load: repartition/_metadata
 took 2.40700 s, mem diff 2081.992MiB [start: 199.816MiB, end: 2281.809MiB]0, 0
load: repartition/part-40.parquet
 took 0.00101 s, mem diff 1.500MiB [start: 2281.809MiB, end: 2283.309MiB]0, 0
load: repartition/_metadata
 took 2.19111 s, mem diff 17.043MiB [start: 2283.309MiB, end: 2300.352MiB]0, 0
gc:   repartition/part-912.parquet
 took 0.00223 s, mem diff -2.000MiB [start: 2300.352MiB, end: 2298.352MiB]0, 0
load: repartition/_metadata
 took 2.16370 s, mem diff 0.629MiB [start: 2298.352MiB, end: 2298.980MiB]0, 0

"mimalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-887.parquet
 took 0.00150 s, mem diff 5.285MiB [start: 189.285MiB, end: 194.570MiB]0, 0
load: repartition/_metadata
 took 2.40356 s, mem diff 2078.855MiB [start: 194.570MiB, end: 2273.426MiB]0, 0
load: repartition/_metadata
 took 2.25088 s, mem diff 15.887MiB [start: 2273.426MiB, end: 2289.312MiB]0, 0
load: repartition/_metadata
 took 2.23642 s, mem diff -0.391MiB [start: 2289.312MiB, end: 2288.922MiB]0, 0

Then I tuned a bit the statment where I previously just called gc.collect() and added a memory pool cleanup:

        with profiling(f"gc:   {path}"):
            pyarrow.default_memory_pool().release_unused()
            gc.collect()

Following datapoints resulted with this tuned version of the script.

"system" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py repartition/* 3
load: repartition/part-419.parquet
 took 0.00142 s, mem diff 3.000MiB [start: 192.777MiB, end: 195.777MiB]0, 0
gc:   repartition/part-419.parquet
 took 0.00343 s, mem diff -1.695MiB [start: 195.777MiB, end: 194.082MiB]0, 0
load: repartition/_metadata
 took 2.40818 s, mem diff 2080.812MiB [start: 194.082MiB, end: 2274.895MiB]0, 0
gc:   repartition/_metadata
 took 0.17570 s, mem diff -2079.277MiB [start: 2274.895MiB, end: 195.617MiB]0, 0
load: repartition/_metadata
 took 3.15845 s, mem diff 2476.500MiB [start: 195.617MiB, end: 2672.117MiB]0, 0
gc:   repartition/_metadata
 took 0.20289 s, mem diff -2481.230MiB [start: 2672.117MiB, end: 190.887MiB]0, 0
load: repartition/_metadata
 took 3.10904 s, mem diff 2479.500MiB [start: 190.887MiB, end: 2670.387MiB]0, 0
gc:   repartition/_metadata
 took 0.18726 s, mem diff -2479.953MiB [start: 2670.387MiB, end: 190.434MiB]0, 0
load: repartition/part-1549.parquet
 took 0.00179 s, mem diff 1.500MiB [start: 190.434MiB, end: 191.934MiB]0, 0

=> Here, the memory actually seems to be released to the system again!

"jemalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-426.parquet
 took 0.00138 s, mem diff 3.000MiB [start: 193.805MiB, end: 196.805MiB]0, 0
gc:   repartition/part-426.parquet
 took 0.00635 s, mem diff 26.812MiB [start: 196.805MiB, end: 223.617MiB]0, 0
load: repartition/_metadata
 took 2.48826 s, mem diff 2082.836MiB [start: 223.617MiB, end: 2306.453MiB]0, 0
load: repartition/_metadata
 took 2.23224 s, mem diff 16.172MiB [start: 2306.453MiB, end: 2322.625MiB]0, 0
gc:   repartition/part-267.parquet
 took 0.00250 s, mem diff -1.762MiB [start: 2322.625MiB, end: 2320.863MiB]0, 0
gc:   repartition/part-839.parquet
 took 0.00231 s, mem diff -1.988MiB [start: 2320.863MiB, end: 2318.875MiB]0, 0
load: repartition/_metadata
 took 2.18774 s, mem diff 0.621MiB [start: 2318.875MiB, end: 2319.496MiB]0, 0

=> jemalloc memory pool doesn't care for the explicit cleanup call apparently.

"mimalloc" memory pool (this was increasing & releasing memory on every iteration and spamming stdout, I had to limit output to cases where memory changed by more than 5MiB):

$ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py repartition/* 3
load: repartition/_metadata
 took 2.37767 s, mem diff 2079.824MiB [start: 193.172MiB, end: 2272.996MiB]0, 0
load: repartition/_metadata
 took 2.14551 s, mem diff 17.660MiB [start: 2271.785MiB, end: 2289.445MiB]0, 0

=> Memory leak also still seems to appear.

pitrou commented 1 week ago

Ok, so what this tell us is that there is no actual memory leak in Arrow (at least on this use case :-)), but the memory allocator may decide to hold onto some memory to make further allocations faster.

Note that the memory is not necessarily unavailable to other applications. For example, the memory allocator might have set these memory areas as MADV_FREE, which lets the kernel free the pages when it needs to, but not necessarily immediately.

This all shows how complicated it is to get accurate memory footprint measurements on modern OSes...

jonded94 commented 1 week ago

Our use case actually would requires that the workloads require as little memory as possible and also hand it back over to the system as soon as possible. If reading in a big file suddenly keeps all processes at ~2GiB more memory than they ideally would need, this is somewhat of a dealbreaker.

That all memory pools per default don't hand memory which is no longer required back to the system certainly could be considered unexpected by some. That for some reason, only the system memory pool seems to react to .release_unused() calls, could be even considered a bug? It's unclear for me as a user how I would make sure to explicitly free that unused memory.

pitrou commented 1 week ago

As I said, RSS can be misleading as it shows some memory to be in use even though the kernel might be able to reuse it if desired (the MADV_FREE case I linked to).

I'm not saying this is exactly what happens here, but it's quite likely given that the system allocator returns the memory more eagerly.

One way to clear this up would be: run this reproducer and keep the Python process running (for example using time.sleep(3600) at the end). Then start another process that tries to use all memory, and see if the Python process' RSS finally shrinks.

You may also want to open a question on the https://github.com/microsoft/mimalloc issue tracker, they'll probably be able to give more precise insight. Also, you could experiment with some of the environment variables here: https://microsoft.github.io/mimalloc/environment.html

jonded94 commented 1 week ago

We have a script that launches quite a lot of Python processes that use something very similar to the test script that I've shown. Unfortunately, even with 250GiB RAM available on the specific host, we see OOM errors after a handful iterations, so it appears that the memory is not entirely free to use or at least the kernel is getting uneasy.

We then replaced the very few used pyarrow methods with a custom in-house written, very shallow PyO3 wrapper around the parquet Rust crate which offers similar functionality. With that, we see constant memory load, regardless of long the script was run, and we're using only a few dozen GiBs total; that's far less than the pyarrow implementation.

I know that our use case is pretty specific, but I wanted to share our experience regardless. If my experience is too vague to be of any actual debugging value to you, we can close the issue.

pitrou commented 1 week ago

Ok, I think we're talking about two separate issues here:

  1. the memory consumption when a Parquet metadata file is loaded (the 2 GB in your reproducer)
  2. the fact that said memory consumption seems to remain persistent even when memory is explicitly released

As I said, I think the second issue is unrelated to PyArrow, and you can probably mitigate it either by changing the default memory pool (e.g. switch to "system" if it doesn't reduce performance for you) or try to tune memory allocator options (for example by experimenting with the mimalloc environment variables I linked to).

However, the first issue may perhaps deserve improving in PyArrow. For that, it would be nice if you could compare the memory consumption with PyArrow and with parquet-rs. Also, hopefully you can upload a reproducer file for us to look at?

jonded94 commented 1 week ago

As all of this is unfortunately an internal project, I'm unable to share any specifics, especially files. I'm very sorry! Maybe I can find some time to create synthetic files which show a similar behaviour.

it would be nice if you could compare the memory consumption with PyArrow and with parquet-rs.

I can do that with a similar workload that does not read metadata, but actual rows. This unfortunately is not strictly connected to the original issue mentioned here, I'm sorry! But it nevertheless could be interesting if I share our results of that. Sorry that this issue is so meandering.

This workload just consumed RecordBatch instances of size 2250 with pyarrow.Dataset.Fragment.to_batches (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Fragment.html#pyarrow.dataset.Fragment.to_batches). Nothing was done with them. They were just immediately discarded, very similar to the original script I shared.

image

As one can see, memory is just steadily increasing. Running this script in parallel will very easily lead to OOM errors (which we also could reproduce).

Then we tried creating an entirely new memory pool for every data iteration as a workaround: image This helped, but the memory load still is quite high.

When using the Rust reader for the same workload, memory load seemed to behave much more predictibly and was quite a bit lower: image

pitrou commented 1 week ago

As all of this is unfortunately an internal project, I'm unable to share any specifics, especially files. I'm very sorry! Maybe I can find some time to create synthetic files which show a similar behaviour.

Or perhaps you can use the "parquet-dump-footer" executable here, it has an option to scrub any sensitive information from the metadata.