Closed theroggy closed 1 year ago
There are a lot of options in the arrow -> pandas to_pandas
conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords for to_pandas
.
For this split_blocks
keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).
The self_destruct
should indeed be possible to always call. Although I wonder how much difference that makes, since the table
variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.
For this split_blocks keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).
The self_destruct should indeed be possible to always call. Although I wonder how much difference that makes, since the table variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.
I did a quick test to see the impact of this change on the peak memory usage of read_dataframe... I restarted the python process between each test, to avoid any influence of the order I ran the tests.
Obviously it is just one specific case, so not sure if it is comparable for other files, but it is better to have one test than no test :-).
I used the following script/file, as it was the motive to make the change: the script crashed with memory errors on my laptop. I ran the test on a real computer. The file read is an openstreetmap (.pbf) file of 540 MB with 5.732.130 rows and 26 columns.
import psutil
import pyogrio
url = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, sql="SELECT * FROM multipolygons")
print(psutil.Process().memory_info())
Results:
# split_blocks=True, self_destruct=True: peak_wset=9.296.629.760 -> 9,3 GB
# split_blocks=False, self_destruct=True: peak_wset=10430320640 -> 10,4 GB
# split_blocks=False, self_destruct=False: peak_wset=12530679808 -> 12,5 GB
@jorisvandenbossche do you see any disadvantages to use split_blocks=True
, as it does seem to make a measurable difference in peak memory usage.
There are a lot of options in the arrow -> pandas
to_pandas
conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords forto_pandas
.
Something like a an extra parameter for read_dataframe
called e.g. arrow_to_pandas_kwargs: Dict[str, Any]
we can pass to the to_pandas
call?
I gave it a try like that in the latest commit.
Sorry for the late reply, and thanks for the update!
do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.
For typical usage, I don't expect much difference, but with many columns there can be a benefit of having consolidated columns in pandas (so for actually benchmarking the impact, you also need to consider potential follow-up operations on the pandas DataFrame ..). Now I personally think we should consider switching the default, but I would prefer to follow pyarrow on this for consistency, and see if we we want to change this on the pyarrow side.
OK, I removed the overrule of the split_blocks
param.
to_pandas
to influence e.g. the way data is returned.arrow
documentation, passing split_blocks=True and self_destruct=True should decrease peak memory usage ofto_pandas
in some cases. In read_dataframe it seems more logical to use those defaults: https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandasRelated to #262 and resolves #241