Open Casyfill opened 6 years ago
as I've mentioned, one dataframe is proprietary, so i've replaced it with zipcodes for the reproducibility. another one is census tracts Weirdly enough, basic overlay works ok on the zipcodes to census tracts (will look into that), but the alternative one fails
@Casyfill can you explain what weird means? What fails? I just tried the overlay
and it worked perfectly fine as far as I can tell. Just need to ensure both dataframes are in the same crs
. Here's what I did
import geopandas as gp
cts = gp.read_file('./2010 Census Tracts/geo_export_976bb9fc-0698-4e76-9249-a55c08fb3569.shp')
zps = gp.read_file('./ZIP_CODE_040114/ZIP_CODE_040114.shp')
cts.crs
zps.crs
cts.to_crs(crs=zps.crs, inplace=True)
splits = gp.overlay(zps, cts, how='intersection')
splits.shape
zps.shape
cts.shape
cts.plot()
zps.plot()
splits.plot()
@ozak did you try it with your function from here https://github.com/geopandas/geopandas/pull/338#issuecomment-290303715 ? As @Casyfill seems to mention with this data it works with geopandas.overlay
, but not with the one from that issue.
@jorisvandenbossche I have included that implementation in my local geopandas installation since I use this function a lot and it is much faster. But you are right, my example code was using the old function. Here's the correct way to call it based on my current geopandas
splits = gp.tools.overlay(zps, cts, how='intersection')
Here's the output
As far as I can tell it is the same. Another advantage of my implementation is that you do not need to change crs
it does it for you automatically
cts = gp.read_file('./2010 Census Tracts/geo_export_976bb9fc-0698-4e76-9249-a55c08fb3569.shp')
splits = gp.tools.overlay(zps, cts, how='intersection')
Data has different projections.
Converted data to projection of first GeoPandas DatFrame
Obviously the output will look different due to the difference in crs
, but the shape of the dataframe is the same. Not sure why it failed for @Casyfill
By "weird" I meant squares in the lower-left corner and along the lower 1/3 horizontal.
Huh, I will retry it thoroughly on my side - maybe something wrong with the dataset, or my installation. Here I published the whole code for this new function:
https://gist.github.com/Casyfill/4973a7efb440692b848d4bb462436ea6
@Casyfill I think I know what may be going on, you are subsetting but not resetting the index in cell 2. For some reason this causes issues. Can you try running
cts = gp.read_file('ct2010nyc.geojson')[['id', 'geometry']]
cts = cts[cts.intersects(SE.spatial.get_borough_borders().unary_union)]
cts.reset_index(inplace=True)
It seems not having same crs
and having an incomplete index causes issues.
it seems that wasn't an issue.
I've used another version of census tracts now, and still no luck - current version of overlays return same weird squares, while this new function returns most of the polygons - but somehow looses some others (check upper west side):
Are you getting any error messages in the terminal (not shown in the notebook or console)? Have you tried uninstalling and reinstalling geopandas
? Not sure why it would fail on your side and work in mine? Do you use conda
? Can you try creating a simple environment where you can test this?
@Casyfill Can you create a reproducible example? Like the notebook you showed, but with the data you linked to above and @ozak used in this example code (we don't have "ct2010nyc.geojson", unless this is downloaded from https://data.cityofnewyork.us/City-Government/2010-Census-Tracts/fxpq-c8ku but was renamed), and without using the SE
library we don't have.
@jorisvandenbossche sure thing! @ozak, yes, I am working on conda virtual environment. as per your request, I tried to re-install geopandas, and now have an installation issue :-/ (one described here,
Referenced from: /Users/philippk/anaconda3/envs/py36/lib/libgdal.20.dylib
Reason: Incompatible library version: libgdal.20.dylib requires version 9.0.0 or later, but libiconv.2.dylib provides version 8.0.0
Perhaps, this might have had an effect on the overlay computations as well. I will try to resolve the issue and provide reproducible example
I just tried it myself, see this notebook: http://nbviewer.ipython.org/5a78805c56209519b5b0eefe1ae86ba8
Visually I get the same results for both geopandas.overlay as method from @ozak, but there is an actual difference (the shape of the results are different) (and the timing is also hugely different!)
thanks,
well, I will dig into the gdal
issues then, I guess
@jorisvandenbossche the difference in the number of geometries in this case must reflect the different way in which both functions generate the intersection (see discussion #338, #666). The last pic suggests that they do create the same intersections. Indeed the timing is orders of magnitude faster in my implementation (was large in my tests, but this one really makes it very clear).
Just want to confirm that the same notebook indeed, worked out perfectly, which means that the issue is with my geometry. Moreover, it worked on my initial data once I replaced census tracts geojson file with a shapefile - which probably means there was an issue with my geometry.
thank you both for your time, and separate thanks for a blazing fast spatial_overlay
!
PS: it would be great to somehow analyze geometry quality of a dataset, similar to QGIS geometry check
well, I will dig into the gdal issues then, I guess
You could try to create a new clean environment? (that's what also solved it in the issue you linked to). I suppose the problem is that you updates some libraries with conda-forge, and there is some conflict between other libraries from the default channel.
@jorisvandenbossche yeah, that is what I did
Moreover, it worked on my initial data once I replaced census tracts geojson file with a shapefile - which probably means there was an issue with my geometry
Is the geojson file public? As in any case, this should not happen.
PS: it would be great to somehow analyze geometry quality of a dataset, similar to QGIS geometry check
What does this exactly do? There is the GeoSeries.is_valid attribute that you could check.
@Casyfill I think that is what the geodataframe
function is_valid
does. Usually something like df.geometry = df.geometry.buffer(0)
solves simple issues that cause geometries to not be valid.
Two geodataframes with boundaries (census tracts and proprietary boundaries):
after
splits = gp.overlay(nhds, cts, how='intersection')
result is weired: