evetion / GeoDataFrames.jl

Simple geographical vector interaction built on top of ArchGDAL
https://www.evetion.nl/GeoDataFrames.jl/dev/
MIT License
67 stars 6 forks source link

Add transaction support to `write`. #72

Closed evetion closed 2 months ago

evetion commented 2 months ago

Supersedes #69, but made @maxfreu a co-author on this PR.

Apart from formatting, the relevant bits are here: feat/use-transactions?expand=1#diff-db104fa7ac

From inspection of the GDAL code, and the code in #69, it seems copy is not necessary at all. Direct creation of a layer is possible, but only using addfeature (adding it), as createfeature actually tries to overwrite an existing one (and can't find one) when the layer is linked to a dataset. 🤷🏻

Also adds use of ogr_f_setgeomfielddirectly, which prevents making a copy of the geometry. Overall this code should be faster and use less memory. @maxfreu Can you confirm a 30x speed improvement with this code as well?

evetion commented 2 months ago

Also note that I used GDALDatasetStartTransaction on the dataset level (https://gdal.org/api/raster_c_api.html#gdal_8h_1a57e354633ef531d521b674b4e5321369) instead of the OGR_L_StartTransaction at the layer level, as recommended by GDAL.

maxfreu commented 2 months ago

Supersedes https://github.com/evetion/GeoDataFrames.jl/pull/69, but made @maxfreu a co-author on this PR.

I'm fine with that :)

Overall, this seems to be a solid improvement. When I wrote the initial code, I just copied the older stuff from ArchGDAL, with all its flaws. The main problem for me back then was lacking documentation and the cumbersome API of (Arch)GDAL, so the outcome was sub-optimal.

Also adds use of ogr_f_setgeomfielddirectly, which prevents making a copy of the geometry. Overall this code should be faster and use less memory.

Very good, I think these changes must be upstreamed to ArchGDAL.

I used GDALDatasetStartTransaction on the dataset level

Yeah, I read the documentation just yesterday :sweat_smile:

Can you confirm a 30x speed improvement with this code as well?

I'll benchmark it and let you know.

The only nitpick: If we want really solid code, then we should also include transaction rollback.

maxfreu commented 2 months ago

I tested both implementations with a dataframe containing ~38k polygons with ids and took several timings. I'm surprised, but your code is consistently slower than the old version and also allocates more:

# old
julia> @time GeoDataFrames.write("/tmp/deleteme.sqlite", df; use_gdal_copy=false)
  0.286841 seconds (655.66 k allocations: 13.266 MiB)
"/tmp/deleteme.sqlite"

# new
julia> @time GeoDataFrames.write("/tmp/deleteme.sqlite", df)
  0.439611 seconds (3.39 M allocations: 99.919 MiB, 3.43% gc time)
"/tmp/deleteme.sqlite"
evetion commented 2 months ago

I tested both implementations with a dataframe containing ~38k polygons with ids and took several timings. I'm surprised, but your code is consistently slower than the old version and also allocates more:

Always good to profile. So I forgot to shortcut the _convert method, otherwise we always convert ArchGDAL geometries. Now it requires just one unsafe_copy. I now have less allocations, and twice as fast code for writing a 7k polygon geopackage.

old: 0.152208 seconds (271.71 k allocations: 4.585 MiB) this PR: 0.062026 seconds (174.90 k allocations: 3.108 MiB)

maxfreu commented 2 months ago

I can confirm the improvement; for me it went down from 0.44s to 0.2s, which is about 1.8 times faster than my code and 18 times faster than master. Looking forward to the merge :)

maxfreu commented 2 months ago

Whoop whoop :tada: Thanks! :)