Issues with overlay - oversimplifies in some cases, not sure what's wrong

Casyfill commented 6 years ago

Two geodataframes with boundaries (census tracts and proprietary boundaries):

after splits = gp.overlay(nhds, cts, how='intersection') result is weired:

Casyfill commented 6 years ago

as I've mentioned, one dataframe is proprietary, so i've replaced it with zipcodes for the reproducibility. another one is census tracts Weirdly enough, basic overlay works ok on the zipcodes to census tracts (will look into that), but the alternative one fails

Casyfill commented 6 years ago

ozak commented 6 years ago

@Casyfill can you explain what weird means? What fails? I just tried the overlay and it worked perfectly fine as far as I can tell. Just need to ensure both dataframes are in the same crs. Here's what I did

import geopandas as gp
cts = gp.read_file('./2010 Census Tracts/geo_export_976bb9fc-0698-4e76-9249-a55c08fb3569.shp')
zps = gp.read_file('./ZIP_CODE_040114/ZIP_CODE_040114.shp')
cts.crs
zps.crs
cts.to_crs(crs=zps.crs, inplace=True)
splits = gp.overlay(zps, cts, how='intersection')
splits.shape
zps.shape
cts.shape
cts.plot()
zps.plot()
splits.plot()

jorisvandenbossche commented 6 years ago

@ozak did you try it with your function from here https://github.com/geopandas/geopandas/pull/338#issuecomment-290303715 ? As @Casyfill seems to mention with this data it works with geopandas.overlay, but not with the one from that issue.

ozak commented 6 years ago

@jorisvandenbossche I have included that implementation in my local geopandas installation since I use this function a lot and it is much faster. But you are right, my example code was using the old function. Here's the correct way to call it based on my current geopandas

splits = gp.tools.overlay(zps, cts, how='intersection')

Here's the output

As far as I can tell it is the same. Another advantage of my implementation is that you do not need to change crs it does it for you automatically

cts = gp.read_file('./2010 Census Tracts/geo_export_976bb9fc-0698-4e76-9249-a55c08fb3569.shp')
splits = gp.tools.overlay(zps, cts, how='intersection')
Data has different projections.
Converted data to projection of first GeoPandas DatFrame

Obviously the output will look different due to the difference in crs, but the shape of the dataframe is the same. Not sure why it failed for @Casyfill

Casyfill commented 6 years ago

By "weird" I meant squares in the lower-left corner and along the lower 1/3 horizontal.

Huh, I will retry it thoroughly on my side - maybe something wrong with the dataset, or my installation. Here I published the whole code for this new function:

https://gist.github.com/Casyfill/4973a7efb440692b848d4bb462436ea6

ozak commented 6 years ago

@Casyfill I think I know what may be going on, you are subsetting but not resetting the index in cell 2. For some reason this causes issues. Can you try running

cts = gp.read_file('ct2010nyc.geojson')[['id', 'geometry']]
cts = cts[cts.intersects(SE.spatial.get_borough_borders().unary_union)]
cts.reset_index(inplace=True)

It seems not having same crs and having an incomplete index causes issues.

Casyfill commented 6 years ago

it seems that wasn't an issue.

I've used another version of census tracts now, and still no luck - current version of overlays return same weird squares, while this new function returns most of the polygons - but somehow looses some others (check upper west side):

ozak commented 6 years ago

Are you getting any error messages in the terminal (not shown in the notebook or console)? Have you tried uninstalling and reinstalling geopandas? Not sure why it would fail on your side and work in mine? Do you use conda? Can you try creating a simple environment where you can test this?

jorisvandenbossche commented 6 years ago

@Casyfill Can you create a reproducible example? Like the notebook you showed, but with the data you linked to above and @ozak used in this example code (we don't have "ct2010nyc.geojson", unless this is downloaded from https://data.cityofnewyork.us/City-Government/2010-Census-Tracts/fxpq-c8ku but was renamed), and without using the SE library we don't have.

Casyfill commented 6 years ago

@jorisvandenbossche sure thing! @ozak, yes, I am working on conda virtual environment. as per your request, I tried to re-install geopandas, and now have an installation issue :-/ (one described here,

Referenced from: /Users/philippk/anaconda3/envs/py36/lib/libgdal.20.dylib
  Reason: Incompatible library version: libgdal.20.dylib requires version 9.0.0 or later, but libiconv.2.dylib provides version 8.0.0

Perhaps, this might have had an effect on the overlay computations as well. I will try to resolve the issue and provide reproducible example

jorisvandenbossche commented 6 years ago

I just tried it myself, see this notebook: http://nbviewer.ipython.org/5a78805c56209519b5b0eefe1ae86ba8

Visually I get the same results for both geopandas.overlay as method from @ozak, but there is an actual difference (the shape of the results are different) (and the timing is also hugely different!)

Casyfill commented 6 years ago

thanks, well, I will dig into the gdal issues then, I guess

ozak commented 6 years ago

@jorisvandenbossche the difference in the number of geometries in this case must reflect the different way in which both functions generate the intersection (see discussion #338, #666). The last pic suggests that they do create the same intersections. Indeed the timing is orders of magnitude faster in my implementation (was large in my tests, but this one really makes it very clear).

Casyfill commented 6 years ago

Just want to confirm that the same notebook indeed, worked out perfectly, which means that the issue is with my geometry. Moreover, it worked on my initial data once I replaced census tracts geojson file with a shapefile - which probably means there was an issue with my geometry.

thank you both for your time, and separate thanks for a blazing fast spatial_overlay!

PS: it would be great to somehow analyze geometry quality of a dataset, similar to QGIS geometry check

jorisvandenbossche commented 6 years ago

well, I will dig into the gdal issues then, I guess

You could try to create a new clean environment? (that's what also solved it in the issue you linked to). I suppose the problem is that you updates some libraries with conda-forge, and there is some conflict between other libraries from the default channel.

Casyfill commented 6 years ago

@jorisvandenbossche yeah, that is what I did

jorisvandenbossche commented 6 years ago

Moreover, it worked on my initial data once I replaced census tracts geojson file with a shapefile - which probably means there was an issue with my geometry

Is the geojson file public? As in any case, this should not happen.

jorisvandenbossche commented 6 years ago

PS: it would be great to somehow analyze geometry quality of a dataset, similar to QGIS geometry check

What does this exactly do? There is the GeoSeries.is_valid attribute that you could check.

ozak commented 6 years ago

@Casyfill I think that is what the geodataframe function is_valid does. Usually something like df.geometry = df.geometry.buffer(0) solves simple issues that cause geometries to not be valid.

geopandas / geopandas

Issues with overlay - oversimplifies in some cases, not sure what's wrong #677