geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
503 stars 44 forks source link

Use case: load a large shapefile / geographical filter / handle with geopandas #39

Open gansanay opened 3 years ago

gansanay commented 3 years ago

Hi,

My use case is to study only the density distribution of population in the neighborhood of a given point. The source data is the global shapefile for a 200x200m grid of France: https://www.insee.fr/fr/statistiques/4176290?sommaire=4176305

What I was doing with a smaller shapefile (1000x1000m grid) was:

With the more precise grid, it doesn't fit in memory and I can't finish the first step. I would like to do those first loading / reducing steps in dask-geopandas before continuing with geopandas.

Let me know if you already see a smart solution or how I can help!

Guillaume

martinfleis commented 3 years ago

Hi, I think that dask-geopandas may help you with it.

load the entire shapefile

File IO still needs to be implemented. There's proof of a concept in #11. You can either try to write a custom code to do that in your case or try to implement fiona-based IO in dask-geopandas (following @jorisvandenbossche's example). LAtter is of course preferable :).

project to a plane area get the centroids for each polygon reprojet to a geodic projection

both to_crs and centroid are implemented and should work as you know them from geopandas.

keep polygons with the nearest centroids (maybe 12-16 of them)

This may be tricky. How do you do that for smaller data? Ideally, this would require spatial index on top of dask.GeoDataFrame which is not yet implemented (a bit of a discussion is in #8). Also, you will need spatially partitioned data to make this operation efficient enough (again #8).

Just out of curiosity - since you are working with a grid, wouldn't be actually easier to convert it to a raster and use something like xarray-spatial?

gansanay commented 3 years ago

Thank you, I will look into these issues.

Right now I am filtering nearest centroids using sklearn.neighbors.BallTree with geopy.distance.geodesic as a metric, which is horribly long considering I am just looking for points closer than a certain radius to my target point.

Concerning your last question... you probably know much more than I do and it's a great learning opportunity for me!

Zeroto521 commented 3 years ago

I'm also finding some ways to improve geopandas io performance.

Geopandas may cost too much time when reading or writing the million-level data.

Dipendra2024 commented 1 week ago

Hi, is there any update on this issue? I've also found the reading and writing times with GeoPandas to be quite slow and time-consuming. I reviewed the documentation but couldn't find any ways to speed up the process of reading large shapefiles.

theroggy commented 1 week ago

Hi, is there any update on this issue? I've also found the reading and writing times with GeoPandas to be quite slow and time-consuming. I reviewed the documentation but couldn't find any ways to speed up the process of reading large shapefiles.

If I understand you correctly you are just talking about the speed of reading an writing, not about being able to process files larger than memory? If this is the case, there is news. From geopandas 1.0 the default engine used to read and write data has become pyogrio, which should already be a lot faster. There is an even faster mode in pyogrio that is "experimental" still that can be activated by passing use_arrow=True in read_file and to_file.

Something like this:

gdf = gpd.read_file(path, use_arrow=True)
gdf.to_file(path, use_arrow=True)
Dipendra2024 commented 1 week ago

Sure, will give it a try. Thank you, a lot. One more thing, I do have issue with dissolve. I do have around 400K polygons and holes in it. I am trying to fill it, it is taking ages -more than 48 hours. :)

theroggy commented 1 week ago

Sure, will give it a try. Thank you, a lot. One more thing, I do have issue with dissolve. I do have around 400K polygons and holes in it. I am trying to fill it, it is taking ages -more than 48 hours. :)

If you don't mind trying another package... you can check out geofileops. It is a package I wrote to speed up processing (dissolve, overlays,...) large geo files. geofileops.dissolve specifically uses geopandas/shapely under the hood but applies the dissolve in a "tiled" way... and is often (depending on your data) orders of magnitude faster.

Dipendra2024 commented 1 week ago

Is it compatible with dask?

theroggy commented 1 week ago

Is it compatible with dask?

Not sure what you mean by that, but it doesn't use dask... it uses plain multiprocessing.

Dipendra2024 commented 1 week ago

I am specifically looking for Dask-GeoPandas options. Anyway, thanks! I can give it a try when I have some spare time to experiment.

martinfleis commented 1 week ago

I do have issue with dissolve. I do have around 400K polygons and holes in it. I am trying to fill it, it is taking ages -more than 48 hours

That does not sound right. Can you open a new issue and outline the way you are trying to do it? I am sure there will be way more efficient ways.

Dipendra2024 commented 1 week ago

I have now managed to fasten the dissolve and infill, however now the compute() is taking more than 12 hours to compute. I did this, dd['geometry']=dissolve_results, and gpd=dd.compute() . Is there any away to speed up?

martinfleis commented 1 week ago

@Dipendra2024 Please open a new issue and paste your complete code there to discuss it.