ecmwf / earthkit-data

A format-agnostic Python interface for geospatial data
Apache License 2.0
53 stars 10 forks source link

How should we load georeferenced shapes (e.g. points, lines, polygons) from a file? #58

Closed samsammurphy closed 1 year ago

samsammurphy commented 1 year ago

How should we load georeferenced shapes (e.g. points, lines, polygons) from a file?

Motivating use case: Load points from .csv and geopoints file (e.g. .geojson, .shp, .kml)

Standard practice: As a geospatial data scientist I would use geopandas to load shapes from a file into a geopandas dataframe.

import geopandas as gpd

# geopandas happily loads .csv, .geojson, .shp, .kml etc.. 
fpath = 'path/to/file'

# geodataframe (a pandas dataframe with geospatial powers)
gdf = gpd.read_file(fpath)

GeoDataFrame makes it easy to, for example, filter by geographic region of interest, change the coordinate reference system (crs) and do other geospatial things like calculate distances and areas.

Known Issue. A .csv file is not natively geospatial. This requires handling. In the case of loading points we would need to know which column(s) contains the point coordinates, and how to parse them, as well as the crs (which is typically not explicit). Here is an example of reading a csv into a geodataframe

# read .csv form file
gdf = gpd.read_file(fpath)

# specify geometry column from the name of the lat and lon columns
gdf['geometry'] = gpd.points_from_xy(gdf[lon_name], gdf[lat_name])

# set the coordinate reference system
gdf = gdf.set_crs(epsg=epsg)

Opinionated view. We should follow the geopandas convention. When we write to file, shapes must be stored in a column called geometry . Geospatial methods work automatically (and exclusively) on the shapes in the geometry column. They should be shapely objects. It's fine to have different types of shapes (e.g. points and lines) in the geometry column. We can read a non geospatial .csv file without falling over but will complain when need (e.g. geometry column does not exist, crs is not set, etc.).

tlmquintino commented 1 year ago

@sandorkertesz please have a look at this and how pdbufr will also play into it. In the end, loading points from geojson or from bufr should feel and look the same, no?

sandorkertesz commented 1 year ago

I think it is a good idea and would also work with BUFR data. At the moment, erthkit-data only offers the to_pandas() method to deal with BUFR, which extracts the specified data/columns into a pandas DataFrame using pdbufr. Since pdbufr already supports the "geometry" and "CRS" columns this data would be fine for GeoPandas.

The main question for me is what API we want to offer for geospatial point data in earthkit-data on top of these methods:

Is that enough and can we do everything using GeoPandas?

samsammurphy commented 1 year ago

Thanks for the chat earlier today @sandorkertesz. Following up on that, and comments above,..

  1. we may want to consider a to_geopandas() method
  2. we talked about nomenclature of these georeferenced shapes (with properties). imo we probably should we call them features (like in the geojson spec).
  3. we probably should have a reader (a.k.a. plugin) for each file type we support that can store collections of features (e.g. geojson, shp, csv)
  4. attached are some sample data. they were mostly created using the excellent geojson.io web app. The exception being geospatial.csv (which was created by reading the geojson into a geopandas geodataframe then saving to .csv so that there is a geometry column with shapely objects represented as text)

Sample Data example_vector_files.zip

sandorkertesz commented 1 year ago

The first to_geopandas implementation will be BUFR (see #84).

samsammurphy commented 1 year ago

@sandorkertesz closing this issue but lmk if I should re-open