Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage

theroggy commented 1 year ago

Add option to pass parameters to to_pandas to influence e.g. the way data is returned.
Based on the arrow documentation, passing split_blocks=True and self_destruct=True should decrease peak memory usage of to_pandas in some cases. In read_dataframe it seems more logical to use those defaults: https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas

Related to #262 and resolves #241

jorisvandenbossche commented 1 year ago

There are a lot of options in the arrow -> pandas to_pandas conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords for to_pandas.

For this split_blocks keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).

The self_destruct should indeed be possible to always call. Although I wonder how much difference that makes, since the table variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.

theroggy commented 1 year ago

For this split_blocks keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).

The self_destruct should indeed be possible to always call. Although I wonder how much difference that makes, since the table variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.

I did a quick test to see the impact of this change on the peak memory usage of read_dataframe... I restarted the python process between each test, to avoid any influence of the order I ran the tests.

Obviously it is just one specific case, so not sure if it is comparable for other files, but it is better to have one test than no test :-).

I used the following script/file, as it was the motive to make the change: the script crashed with memory errors on my laptop. I ran the test on a real computer. The file read is an openstreetmap (.pbf) file of 540 MB with 5.732.130 rows and 26 columns.

import psutil
import pyogrio

url = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, sql="SELECT * FROM multipolygons")
print(psutil.Process().memory_info())

Results:

# split_blocks=True, self_destruct=True: peak_wset=9.296.629.760 -> 9,3 GB
# split_blocks=False, self_destruct=True: peak_wset=10430320640 -> 10,4 GB
# split_blocks=False, self_destruct=False: peak_wset=12530679808 -> 12,5 GB

@jorisvandenbossche do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

theroggy commented 1 year ago

There are a lot of options in the arrow -> pandas to_pandas conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords for to_pandas.

Something like a an extra parameter for read_dataframe called e.g. arrow_to_pandas_kwargs: Dict[str, Any] we can pass to the to_pandas call?

I gave it a try like that in the latest commit.

theroggy commented 1 year ago

Sorry for the late reply, and thanks for the update!

do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

For typical usage, I don't expect much difference, but with many columns there can be a benefit of having consolidated columns in pandas (so for actually benchmarking the impact, you also need to consider potential follow-up operations on the pandas DataFrame ..). Now I personally think we should consider switching the default, but I would prefer to follow pyarrow on this for consistency, and see if we we want to change this on the pyarrow side.

OK, I removed the overrule of the split_blocks param.

geopandas / pyogrio

Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273