geofileops / geobenchmark

Benchmarks for some spatial python libraries
BSD 3-Clause "New" or "Revised" License
2 stars 2 forks source link

Environment details #5

Open martinfleis opened 2 years ago

martinfleis commented 2 years ago

Hi, can you share a bit about the environment you used to test geopandas? I am especially interested in pygeos - is it installed? If not, I would recommend trying that as you will likely get a massive speedup.

Another suggestion would be to use dask-geoapandas instead to vanilla geopandas but I guess you already thought about that :). Since the timing also includes IO, you can also try replacing geopandas.read_file with pyogrio.read_dataframe.

theroggy commented 2 years ago

Hi, can you share a bit about the environment you used to test geopandas? I am especially interested in pygeos - is it installed? If not, I would recommend trying that as you will likely get a massive speedup.

Sure. The tests were ran with pygeos installed...

Another suggestion would be to use dask-geoapandas instead to vanilla geopandas but I guess you already thought about that :).

Yes! I saw a few hours ago that dask-geopandas 0.1.0 was released, so I already started on a dask-geopandas version :-).

Since the timing also includes IO, you can also try replacing geopandas.read_file with pyogrio.read_dataframe.

I did some tests using pyogrio and this indeed gives a huge difference, especially for the writing part. I didn't use it in the benchmark because during some tests I did it didn't feel like production ready yet + the integration is geopandas isn't ready yet. However, because it is such a big difference and this is really interesting to know, I'll add a version of the geopandas benchmark that uses pyogrio for IO...

theroggy commented 2 years ago

I added a version of geopandas using pyogrio for I/O... and as expected the time spent on the IO part was reduced significantly. Especially for the buffer operation this gives a huge difference, as the operation itself takes very little time...

For e.g. dissolve the impact is obviously smaller, as that operations needs a lot more processing time.

I also added a benchmark for dask-geopandas, but at the moment only for buffer. I haven't ever used dask, so I'l need to figure out a bit how I use it best for the more interesting cases... I saw in the manual of dask-geopandas there is a specific section about dissolve... so I'll have to have a look at that...

Some remarks regarding the dask-geopandas buffer benchmark:

theroggy commented 2 years ago

I added dissolve benchmarks for dask-geopandas now as well, but the results aren't great. Based on the documentation on how it works under the hood it is probably also "normal" that it isn't faster than vanilla geopandas as the unary_union operation is applied twice on all geometries, which is quite costly. Or... I did something stupid in the implementation, that's obviously also possible ;-).

Am I right that overlay operations aren't supported yet? In that case I can't implement the intersect benchmark yet...

martinfleis commented 2 years ago

Yes, dissolve is not always faster, that is why documentation includes that extra snippet allowing faster option for in-memory data - https://dask-geopandas.readthedocs.io/en/stable/guide/dissolve.html#alternative-solution.

The dissolve implementation in dask-geopandas is designed to be scalable and distributable, so it can work out-of-core but at the cost of performance in some situations.

Am I right that overlay operations aren't supported yet?

Correct. As in overlay is not implemented but predicates and operations themselves are implemented.

theroggy commented 2 years ago

Dissolve is indeed a bitch to get fast + scalable. Took me a while in geofileops as well to get it (a bit) right.

For overlays, I saw that clip does exist already... but when I try it on my test case it always crashes due to lack of memory?