geopandas / geopandas

Python tools for geographic data
http://geopandas.org/
BSD 3-Clause "New" or "Revised" License
4.52k stars 935 forks source link

BUG: Loading in geojson through read_file misses certain entries in the output GDF #1973

Closed FDenker closed 3 years ago

FDenker commented 3 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example


import json
## JSON Version 2.0.9 
import geopandas as gpd
import urllib

## I have uploaded a file with 65 entries 
url = "https://github.com/FDenker/GeoPandas-Geojson-Issue/raw/main/geopandas_not_found.geojson"

## This is the heart of the issue
## This is not due to the download (this gives the same result locally)
## This will be empty
empty_gdf=gpd.read_file(url)

## However, if we load it in 
file = urllib.request.urlopen(url)
loaded_json = json.load(file)
## This returns a GeoDataFrame with the right information
correct_gdf=gpd.GeoDataFrame.from_features(loaded_json['features'])

Problem description

When reading in a specific kind of GeoJSON (output of an osmium-tool export to be exact) the read_file function skips over specific elements. However, it does not return an error but rather an empty GeoDataFrame. It is important to mention that this only occurs for a low number of entries and the GeoJSON I have linked above only includes entries in which the read_file function does not work. When I normally import GeoJSON files that are exported from the osmium-tool about 99 % of the entries are reflected in the GeoDataFrame.

At the same time, if I load in the GeoJSON as simple JSON and then pass the 'features' to the from_features function it returns proper GeoDataFrame with all the data that is in the GeoJSON.

The error persists both on my local windows machine (running python 3.8.3) and on an Ubuntu 18.04 machine (running 3.7.10 and the GitHub version of the geopandas). I have therefore also posted both system info below.

Expected Output

GeoDataFrame with 65 rows containing attributes and valid geometries.

Output of geopandas.show_versions()

Windows machine:

SYSTEM INFO ----------- python : 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)] executable : C:\Users\$USER\anaconda3\python.exe machine : Windows-10-10.0.21390-SP0 GEOS, GDAL, PROJ INFO --------------------- GEOS : None GEOS lib : None GDAL : 3.3.0 GDAL data dir: None PROJ : 7.2.1 PROJ data dir: C:\Users\$USER\anaconda3\lib\site-packages\pyproj\proj_dir\share\proj PYTHON DEPENDENCIES ------------------- geopandas : 0.9.0 pandas : 1.0.5 fiona : 1.8.20 numpy : 1.18.5 shapely : 1.7.1 rtree : 0.9.4 pyproj : 3.1.0 matplotlib : None mapclassify: None geopy : None psycopg2 : 2.8.6 (dt dec pq3 ext lo64) geoalchemy2: None pyarrow : 2.0.0 pygeos : 0.10

Linux machine:

SYSTEM INFO ----------- python : 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] executable : /opt/tljh/user/bin/python machine : Linux-4.15.0-143-generic-x86_64-with-debian-buster-sid GEOS, GDAL, PROJ INFO --------------------- GEOS : 3.8.0 GEOS lib : /usr/lib/x86_64-linux-gnu/libgeos_c.so GDAL : 2.4.4 GDAL data dir: /opt/tljh/user/lib/python3.7/site-packages/fiona/gdal_data PROJ : 7.0.1 PROJ data dir: /opt/tljh/user/lib/python3.7/site-packages/pyproj/proj_dir/share/proj PYTHON DEPENDENCIES ------------------- geopandas : 0.9.0+36.gcb88dd4 pandas : 0.25.3 fiona : 1.8.17 numpy : 1.19.1 shapely : 1.7.1 rtree : 0.9.7 pyproj : 2.6.1.post1 matplotlib : 3.3.2 mapclassify: None geopy : 2.1.0 psycopg2 : 2.8.6 (dt dec pq3 ext lo64) geoalchemy2: None pyarrow : 0.17.1 pygeos : 0.10
martinfleis commented 3 years ago

That is strange. I am not able to reproduce the issue, both ways result in a GeoDataFrame with 65 rows. Can you try updating your environments? It may have been fixed along the way, I see that you have some outdated dependencies.

nguyenlienviet commented 3 years ago

I have the same issue and can reproduce @FDenker 's issue. Package versions:

SYSTEM INFO

python : 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] executable : /usr/bin/python3 machine : Linux-5.4.0-80-generic-x86_64-with-Ubuntu-18.04-bionic

GEOS, GDAL, PROJ INFO

GEOS : 3.8.0 GEOS lib : /usr/lib/x86_64-linux-gnu/libgeos_c.so GDAL : 3.3.0 GDAL data dir: /home/nguyenlienviet/.local/lib/python3.6/site-packages/fiona/gdal_data PROJ : 7.2.1 PROJ data dir: /home/nguyenlienviet/.local/lib/python3.6/site-packages/pyproj/proj_dir/share/proj

PYTHON DEPENDENCIES

geopandas : 0.9.0 pandas : 1.1.5 fiona : 1.8.20 numpy : 1.19.5 shapely : 1.7.1 rtree : 0.9.7 pyproj : 3.0.1 matplotlib : 3.2.2 mapclassify: 2.4.3 geopy : 2.2.0 psycopg2 : 2.9.1 (dt dec pq3 ext lo64) geoalchemy2: 0.9.3 pyarrow : 5.0.0 pygeos : 0.10.1

martinfleis commented 3 years ago

I tried to reproduce the issue on macOS and Ubuntu to no avail. It works every time, no matter what I do and how do I set the environment... I'd love to help but not sure how.

@FDenker @nguyenlienviet can you export your environment to yml via conda env export -f environment.yml and share that?

jdmcbr commented 3 years ago

@FDenker @nguyenlienviet This issue is arising because of using GDAL's geojson driver to read the file (as compared to the from_features route that you showed working as expected). There's an environment variable, OGR_GEOJSON_MAX_OBJ_SIZE, that sets the maximum size of individual features (https://gdal.org/drivers/vector/geojson.html). Some of the features in the dataset you have here are sufficiently complex that they're bumping up against whatever that is set to on your system. I'm able to get the behavior you experience by setting that environment variable to a lower value. For you, this should work:

import geopandas as gpd 
import fiona 
url = "https://github.com/FDenker/GeoPandas-Geojson-Issue/raw/main/geopandas_not_found.geojson" 

with fiona.Env(OGR_GEOJSON_MAX_OBJ_SIZE=2000):  
    no_longer_empty_gdf = gpd.read_file(url)

I won't tell you how long it took me to get to the bottom of this one. :sweat_smile:

jorisvandenbossche commented 3 years ago

@jdmcbr Thanks a lot for getting to the bottom of this!