Open gansanay opened 3 years ago
Hi, I think that dask-geopandas
may help you with it.
load the entire shapefile
File IO still needs to be implemented. There's proof of a concept in #11. You can either try to write a custom code to do that in your case or try to implement fiona-based IO in dask-geopandas (following @jorisvandenbossche's example). LAtter is of course preferable :).
project to a plane area get the centroids for each polygon reprojet to a geodic projection
both to_crs
and centroid
are implemented and should work as you know them from geopandas.
keep polygons with the nearest centroids (maybe 12-16 of them)
This may be tricky. How do you do that for smaller data? Ideally, this would require spatial index on top of dask.GeoDataFrame
which is not yet implemented (a bit of a discussion is in #8). Also, you will need spatially partitioned data to make this operation efficient enough (again #8).
Just out of curiosity - since you are working with a grid, wouldn't be actually easier to convert it to a raster and use something like xarray-spatial
?
Thank you, I will look into these issues.
Right now I am filtering nearest centroids using
sklearn.neighbors.BallTree
with geopy.distance.geodesic
as a metric, which is horribly long considering I am just looking for points closer than a certain radius to my target point.
Concerning your last question... you probably know much more than I do and it's a great learning opportunity for me!
I'm also finding some ways to improve geopandas io performance.
Geopandas may cost too much time when reading or writing the million-level data.
Hi, is there any update on this issue? I've also found the reading and writing times with GeoPandas to be quite slow and time-consuming. I reviewed the documentation but couldn't find any ways to speed up the process of reading large shapefiles.
Hi, is there any update on this issue? I've also found the reading and writing times with GeoPandas to be quite slow and time-consuming. I reviewed the documentation but couldn't find any ways to speed up the process of reading large shapefiles.
If I understand you correctly you are just talking about the speed of reading an writing, not about being able to process files larger than memory? If this is the case, there is news. From geopandas 1.0 the default engine used to read and write data has become pyogrio
, which should already be a lot faster. There is an even faster mode in pyogrio
that is "experimental" still that can be activated by passing use_arrow=True
in read_file
and to_file
.
Something like this:
gdf = gpd.read_file(path, use_arrow=True)
gdf.to_file(path, use_arrow=True)
Sure, will give it a try. Thank you, a lot. One more thing, I do have issue with dissolve. I do have around 400K polygons and holes in it. I am trying to fill it, it is taking ages -more than 48 hours. :)
Sure, will give it a try. Thank you, a lot. One more thing, I do have issue with dissolve. I do have around 400K polygons and holes in it. I am trying to fill it, it is taking ages -more than 48 hours. :)
If you don't mind trying another package... you can check out geofileops. It is a package I wrote to speed up processing (dissolve, overlays,...) large geo files. geofileops.dissolve specifically uses geopandas/shapely under the hood but applies the dissolve in a "tiled" way... and is often (depending on your data) orders of magnitude faster.
Is it compatible with dask?
Is it compatible with dask?
Not sure what you mean by that, but it doesn't use dask... it uses plain multiprocessing.
I am specifically looking for Dask-GeoPandas options. Anyway, thanks! I can give it a try when I have some spare time to experiment.
I do have issue with dissolve. I do have around 400K polygons and holes in it. I am trying to fill it, it is taking ages -more than 48 hours
That does not sound right. Can you open a new issue and outline the way you are trying to do it? I am sure there will be way more efficient ways.
I have now managed to fasten the dissolve and infill, however now the compute() is taking more than 12 hours to compute. I did this, dd['geometry']=dissolve_results, and gpd=dd.compute() . Is there any away to speed up?
@Dipendra2024 Please open a new issue and paste your complete code there to discuss it.
Hi,
My use case is to study only the density distribution of population in the neighborhood of a given point. The source data is the global shapefile for a 200x200m grid of France: https://www.insee.fr/fr/statistiques/4176290?sommaire=4176305
What I was doing with a smaller shapefile (1000x1000m grid) was:
With the more precise grid, it doesn't fit in memory and I can't finish the first step. I would like to do those first loading / reducing steps in dask-geopandas before continuing with geopandas.
Let me know if you already see a smart solution or how I can help!
Guillaume