geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
258 stars 22 forks source link

DO NOT MERGE: Refactor dataframe where tests, reduce IF conditional compilation blocks #386

Closed brendan-ward closed 2 days ago

brendan-ward commented 2 months ago

Split up dataframe read with where test to try and probe at #384

Also tries to remove IF blocks marked as deprecated by Cython, except for those still needed for GDAL version specific functions.

brendan-ward commented 2 months ago

After splicing in changes from #349 to correctly release the Arrow stream, I'm no longer able to easily reproduce the segfaults on repeated reruns of CI. Given that these were intermittent, it's hard to say for certain, but per the GDAL docs, we should have been releasing the stream:

On successful return, and when the stream interfaces is no longer needed, it must must be freed...

brendan-ward commented 2 months ago

Clarification, it seems that the stream.release is not being called (is NULL) - presumably because the stream was used in the pyarrow reader? But free(stream) is getting called. Not sure why either of these would avoid an unpredictable segfault, it seems more likely that initializing the .release method to NULL on the ArrowArrayStream prevents calling a random pointer (per the comment), even though the .release method was not defined prior to these changes (maybe something expecting that interface was expecting a .release anyway).

jorisvandenbossche commented 2 months ago

I merged https://github.com/geopandas/pyogrio/pull/349, so let's see if the failures now disappear on main and other PRs.

Although to be honest I also don't fully understand how the changes would fix the crashes ..

Clarification, it seems that the stream.release is not being called (is NULL) - presumably because the stream was used in the pyarrow reader?

Indeed, that's the idea. When we pass the struct to pyarrow, they will "move" it and take ownership of it (i.e. being responsible of releasing it when done reading the data) by setting the release callback to NULL, to ensure we don't call that anymore. (https://arrow.apache.org/docs/dev/format/CDataInterface.html#moving-an-array)

jorisvandenbossche commented 2 months ago

Hmm, still a crash on the merge commit on main .. https://github.com/geopandas/pyogrio/actions/runs/8718790747/job/23916758253