Open asfimport opened 2 years ago
Will Jones / @wjones127: Hi Gianluca,
There are two conversions happening when reading: first, Parquet data is deserialized into Arrow data; second, Arrow data is converted into Pandas / numpy data. Are you able to narrow down during which conversion memory is increasing?
Gianluca Ficarelli: Hi @wjones127 , sure, I added two additional prints in this function:
def pyarrow_load(filename):
table = pyarrow.parquet.read_table(filename)
print_mem(41)
df = table.to_pandas()
print_mem(42)
return df
And the result is:
mprof: Sampling memory every 0.1s
running new process
0 time: 0.0 rss: 139.4
1 time: 4.5 rss: 1256.2
2 time: 6.7 rss: 1269.0
3 time: 7.0 rss: 763.0
4 time: 10.3 rss: 763.0
41 time: 11.1 rss: 1674.6
42 time: 18.9 rss: 16751.4
5 time: 18.9 rss: 16751.4
6 time: 21.9 rss: 16341.4
7 time: 22.1 rss: 15838.9
8 time: 25.1 rss: 960.9
So it seems that the big memory increase is happening when calling table.to_pandas()
Will Jones / @wjones127:
That's helps narrow it down. Are you able to narrow down and share the specific data types (table.schema
}) that seem to be problematic?
Gianluca Ficarelli: Hi @wjones127 , the data in the example are self generated by the script to be as simple and reproducible as possible, and the dataframe contains only a column.
In particular I get this:
>>> table.schema
a: list<item: int64>
child 0, item: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 370
and
print(json.dumps(json.loads(table.schema.metadata[b'pandas']), indent=2))
{
"index_columns": [
{
"kind": "range",
"name": null,
"start": 0,
"stop": 5000000,
"step": 1
}
],
"column_indexes": [
{
"name": null,
"field_name": null,
"pandas_type": "unicode",
"numpy_type": "object",
"metadata": {
"encoding": "UTF-8"
}
}
],
"columns": [
{
"name": "a",
"field_name": "a",
"pandas_type": "list[int64]",
"numpy_type": "object",
"metadata": null
}
],
"creator": {
"library": "pyarrow",
"version": "9.0.0"
},
"pandas_version": "1.4.3"
}
and
>>> df
a
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
... ...
4999995 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4999996 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4999997 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4999998 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
4999999 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[5000000 rows x 1 columns]
Does it help?
Will Jones / @wjones127: Sorry, you are right you had a single column there already.
I tried your repro on my M1 Macbook and didn't see the memory usage you are seeing. (This is with mimalloc allocator, but I got similar results with jemalloc and the system allocator.)
Are you able to reproduce this on the latest versions of Pandas and Numpy? And could you confirm your package version numbers?
❯ python test_pyarrow.py
0 time: 0.0 rss: 79.5
1 time: 2.0 rss: 617.1
2 time: 3.4 rss: 1090.6
3 time: 3.7 rss: 633.6
4 time: 6.7 rss: 633.6
5 time: 10.1 rss: 1942.9
6 time: 13.1 rss: 1942.9
7 time: 13.6 rss: 664.8
8 time: 16.6 rss: 664.8
Gianluca Ficarelli: I tried with a fresh virtualenv, on both Linux and Mac (Intel):
Linux (Ubuntu 20.04, 32 GB):
$ python -V
Python 3.9.9
$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
$ python test_pyarrow.py
0 time: 0.0 rss: 90.8
1 time: 3.0 rss: 1205.7
2 time: 4.6 rss: 1212.6
3 time: 4.8 rss: 710.0
4 time: 8.0 rss: 708.2
5 time: 14.6 rss: 16652.9
6 time: 17.6 rss: 16242.9
7 time: 17.7 rss: 15743.5
8 time: 20.7 rss: 866.2
Mac (Monterey 12.5, 16 GB):
$ python -V
Python 3.9.9
$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
$ python test_pyarrow.py
0 time: 0.0 rss: 64.0
1 time: 4.0 rss: 1075.0
2 time: 6.2 rss: 1136.6
3 time: 6.8 rss: 671.8
4 time: 9.8 rss: 671.8
5 time: 22.9 rss: 2477.4
6 time: 25.9 rss: 2423.4
7 time: 27.1 rss: 180.6
8 time: 30.1 rss: 180.6
but when the same script is retried there is some variability on Mac in the lines 5 and 6 (I observed from 1261 to 4140 MB), while on Linux is always the same (around 16 GB in lines 5, 6, 7).
So it seems that the rss memory usage is high on linux only.
Gianluca Ficarelli: I tested on Linux the previous versions of pyarrow, here are the results:
pyarrow==9.0.0
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 90.6
1 time: 2.8 rss: 1205.4
2 time: 4.4 rss: 1212.3
3 time: 4.7 rss: 709.7
4 time: 7.8 rss: 707.9
5 time: 14.4 rss: 16656.0
6 time: 17.4 rss: 16246.0
7 time: 17.5 rss: 15743.6
8 time: 20.5 rss: 866.3
pyarrow==8.0.0
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 86.2
1 time: 2.8 rss: 1200.9
2 time: 4.3 rss: 2266.2
3 time: 4.6 rss: 1443.6
4 time: 7.7 rss: 703.5
5 time: 14.3 rss: 16648.0
6 time: 17.3 rss: 16238.0
7 time: 17.4 rss: 15738.6
8 time: 20.4 rss: 861.3
pyarrow==7.0.0
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 84.3
1 time: 2.8 rss: 1199.1
2 time: 4.4 rss: 2263.7
3 time: 4.6 rss: 1441.8
4 time: 7.7 rss: 701.6
5 time: 9.8 rss: 3679.9
6 time: 12.8 rss: 3268.3
7 time: 12.9 rss: 2766.6
8 time: 15.9 rss: 859.2
pyarrow==6.0.1
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.1
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 81.9
1 time: 2.9 rss: 1196.8
2 time: 4.5 rss: 2261.4
3 time: 4.7 rss: 1439.0
4 time: 7.8 rss: 698.9
5 time: 9.2 rss: 2224.0
6 time: 12.2 rss: 1740.4
7 time: 12.3 rss: 1238.1
8 time: 15.3 rss: 856.6
pyarrow==6.0.0
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 81.7
1 time: 2.9 rss: 1196.6
2 time: 4.5 rss: 2261.1
3 time: 4.7 rss: 1438.5
4 time: 7.8 rss: 698.4
5 time: 9.2 rss: 2224.9
6 time: 12.2 rss: 1740.1
7 time: 12.3 rss: 1237.7
8 time: 15.3 rss: 856.2
pyarrow==5.0.0
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==5.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 79.2
1 time: 2.8 rss: 1194.0
2 time: 4.3 rss: 2258.3
3 time: 4.5 rss: 1436.2
4 time: 7.7 rss: 696.1
5 time: 9.1 rss: 2221.1
6 time: 12.1 rss: 1736.3
7 time: 12.2 rss: 1235.0
8 time: 15.3 rss: 853.5
So:
I tried again the original script with pyarrow 9.0.0 and 12.0.0. pyarrow 12 is stills slower than pyarrow <= 7 (18 seconds instead of 15-16), but it seems that the memory issue has been solved in one of the versions 10, 11, 12.
pip freeze
numpy==1.24.3 pandas==2.0.1 psutil==5.9.5 pyarrow==12.0.0 python-dateutil==2.8.2 pytz==2023.3 six==1.16.0 tzdata==2023.3
python try_parquet.py
0 time: 0.0 rss: 122.0 1 time: 3.8 rss: 1230.9 2 time: 6.0 rss: 1251.1 3 time: 6.4 rss: 752.1 4 time: 9.6 rss: 750.9 5 time: 11.8 rss: 2208.7 6 time: 14.8 rss: 1798.7 7 time: 15.0 rss: 1303.3 8 time: 18.0 rss: 921.3
- pyarrow 9.0.0 (for reference)
pip freeze
numpy==1.24.3 pandas==2.0.1 psutil==5.9.5 pyarrow==9.0.0 python-dateutil==2.8.2 pytz==2023.3 six==1.16.0 tzdata==2023.3
python try_parquet.py
0 time: 0.0 rss: 86.7 1 time: 3.8 rss: 1195.6 2 time: 5.9 rss: 1214.5 3 time: 6.3 rss: 714.2 4 time: 9.5 rss: 714.2 5 time: 17.5 rss: 16674.7 6 time: 20.5 rss: 16264.7 7 time: 20.7 rss: 15769.2 8 time: 23.7 rss: 891.2
Would you mind provide a flamegraph for benchmark, so that I can see if the time cost in each part is ok
Hi @mapleFU , I re-executed the initial script with pyarrow 12.0.0 and 6.0.1. I used py-spy and memory-profiler to produce the attached graphs.
pip freeze # pyarrow==12.0.0
memory-profiler==0.61.0
numpy==1.24.3
pandas==2.0.1
psutil==5.9.5
py-spy==0.3.14
pyarrow==12.0.0
python-dateutil==2.8.2
pytz==2023.3
six==1.16.0
tzdata==2023.3
pip freeze # pyarrow==6.0.1
memory-profiler==0.61.0
numpy==1.24.3
pandas==2.0.1
psutil==5.9.5
py-spy==0.3.14
pyarrow==6.0.1
python-dateutil==2.8.2
pytz==2023.3
six==1.16.0
tzdata==2023.3
python try_arrow.py # pyarrow==12.0.0
0 time: 0.0 rss: 112.0
1 time: 3.8 rss: 1220.9
2 time: 6.0 rss: 1241.1
3 time: 6.4 rss: 742.2
4 time: 9.7 rss: 740.9
5 time: 11.8 rss: 2194.6
6 time: 14.8 rss: 1784.6
7 time: 15.0 rss: 1289.3
8 time: 18.0 rss: 907.3
python try_arrow.py # pyarrow==6.0.1
0 time: 0.0 rss: 78.7
1 time: 3.8 rss: 1187.1
2 time: 5.9 rss: 1949.4
3 time: 6.3 rss: 732.2
4 time: 9.6 rss: 710.2
5 time: 11.8 rss: 2171.3
6 time: 14.8 rss: 1757.7
7 time: 15.0 rss: 1261.5
8 time: 18.0 rss: 879.5
py-spy record --native --subprocesses python test_pyarrow.py
mprof run --multiprocess python test_pyarrow.py
When a pandas dataframe is loaded from a parquet file using
{}pyarrow.parquet.read_table{
}, the memory usage may grow a lot more than what should be needed to load the dataframe, and it's not freed until the dataframe is deleted.The problem is evident when the dataframe has a {}column containing lists or numpy arrays{}, while it seems absent (or not noticeable) if the column contains only integer or floats.
I'm attaching a simple script to reproduce the issue, and a graph created with memory-profiler showing the memory usage.
In this example, the dataframe created with pandas needs around 1.2 GB, but the memory usage after loading it from parquet is around 16 GB.
The items of the column are created as numpy arrays and not lists, to be consistent with the types loaded from parquet (pyarrow produces numpy arrays and not lists).
Run with memory-profiler:
Output:
Environment: linux Reporter: Gianluca Ficarelli Watchers: Rok Mihevc / @rok
Original Issue Attachments:
Note: This issue was originally created as ARROW-17399. Please see the migration documentation for further details.