geopandas / benchmarks

Benchmark data and code
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Possible datasets for benchmarks #1

Open brendan-ward opened 2 years ago

brendan-ward commented 2 years ago

A few data sources to consider for bigger benchmarks:

U.S. high resolution hydrography data

These are served by 4th-code watersheds (download a *_gdb.zip) that have data within an ESRI File Geodatabase.

We use some of these in pyogrio

Useful for testing intersection of waterbodies and flowlines, clipping, etc.

World Database on Protected Areas (see the download button)

3GB dataset that has terrestrial and marine protected areas

One of the "advantages" for doing bencharks with some of these is that the geometries are not always clean, so these could be good for benchmarking things like making them valid or unioning them together, or intersecting them with admin boundaries like countries or EEZs (below).

Marine regions

For example, the World EEZ (Exclusive Economic Zones) dataset is a useful one to try and intersect with marine protected areas above.

TLouf commented 2 years ago

Another possibility is to leverage OpenStreetMap data, accessed using OSMnx for instance. OSM gives access to any kind of geometries and even mixes of them, using building shapes, administrative regions, railways, points of interest, water bodies...

I'm mentioning OSMnx because that's the package I know which makes it easiest to download and digest OSM data (see this example), but it could be anything else that does the trick. The downside of OSMnx for the purpose of this repo is that it requires networkx, which would be a useless dependency here.

martinfleis commented 2 years ago

OSM is a good source. For performance reasons, it may be better to get the larger data using pyrosm but that is a minor detail.

British Ordnance Survey has a series of GB-wide open datasets with polygons, lines and points at https://osdatahub.os.uk/downloads/open.