Large Scale Geospatial Benchmarks

jrbourbeau commented 2 weeks ago

People love the Xarray/Dask/... software stack for geospatial workloads, but only up to about the terabyte scale. At the terabyte scale this stack can struggle, requiring expertise to work well and frustrating users and developers alike.

To address this, we want to build a large-scale geospatial benchmark suite of end-to-end examples to ensure that these tools operate smoothly up to the 100-TB scale.

We want your help to build a catalog of large scale, end-to-end, representative benchmarks. What does this help look like? We can use:

Ideas of what are the most common workflows in this space like: "People often need to take netCDF files that are generated hourly, rechunk them to be arranged spatially, and then update that dataset every day".
Real datasets to work on. We want to work with real data, not fake data.
Real code that does it. We don’t know the space well enough to write like a user here.

This is a big ask, we know, but we hope that if a few people can contribute something meaningfully then we’ll be able to contribute code changes that accelerate those workflows (and others) considerably.

We’d welcome contributions as comments on this issue.

bahaugen commented 2 weeks ago

I am not sure what would be considered in scope for geospatial workloads but I find that large-scale spatial joins are some of the most unruly computations in my experience. I see all sorts of blogs/benchmarks/demos etc. using the NYC taxi dataset and the NYC neighborhoods as an example of scaling to large geospatial datasets. I find this example interesting but not helpful in many situations. This particular problem gets much harder when both sides of your join are large because you can no longer broadcast your smaller dataset to all of the workers to run several smaller joins on each worker.

I would propose a spatial join with relatively large datasets on both sides of the join. Bonus points for doing polygons (and not just points) on each side of the join. I would certainly suggest one of the open source building footprint datasets as these tend to be pretty large datasets of polygons. I am sure there are tons of point datasets you could join with this to make an interesting benchmark but I would try to find another polygon-based dataset to join with.

Possible polygon based datasets to join to the a building footprint dataset -Census Blocks - This would be useful for assigning buildings to census blocks -Property Parcel Data - This would be useful for assigning buildings to each of the parcels. The issue here is that I am not aware of any large-scale, open source parcel datasets -Another building footprint dataset - Albeit more contrived, it would be an interesting computation from a validation/comparison perspective if you were the developer of one of the building footprint segmentation models or doing comparison as an end user. -Satellite or Aerial Imagery - If you have a dataset with outlines of all of the images in the EO dataset you could join this with the building footprint datasets to figure out which images contain each of the houses.

If you want to make it even more challenging, modifying this to be the closely related but much harder nearest-neigbor join on two large datasets would be a true stress test.

I am not sure how common these problems are in the "wild" but I have come across them multiple times in my career. I know this probably doesn't get to your multi-terabyte threshold but I have found that the scalability of these types of joins gets painful even before you get to terabytes. Also, if I am missing some trick to just makes these computations "just work" I'd love to hear about it.

Feel free to reach out if you want to chat about the benchmark or about my experiences with these types of computations.

upbram commented 2 weeks ago

Based on my experience, every dataset has its own specific patterns, and attempting to use real-time data for benchmarking is often limited to the moment it's collected. Data evolves constantly, and there is no one-size-fits-all solution for benchmarking across different real-time datasets. Customers tend to have unique data characteristics that vary over time, making it challenging to rely on a single dataset for accurate benchmarking in the long term. This dynamic nature of data requires customized, adaptive approaches for each situation rather than static benchmarks based on some real-time data

mrocklin commented 2 weeks ago

Thanks @bahaugen ! Large scale spatial joins are a fun problem (I actually got this working in an early version of dask-geopandas. It was neat!)

I think I should ask the following two questions before we invest much time here:

Is this a common problem? (my gut reaction is that actually "no", it's fairly uncommon (but quite painful when it does occur))
Does it occur in a good end-to-end problem and dataset that we have access to? (we don't have the bandwidth or expertise to hunt these down, and are looking to the community for help here)

I would be excited if the answer to both questions is "yes!" because I think this would be a fun thing to work on.

mrocklin commented 2 weeks ago

Thanks @upbram ! I agree 100% with you. The devil is always in the details.

That being said, lots of problems do look pretty similar, and assuming that we continue to build general purpose software that isn't tailor made to very specific problems, I have moderately high confidence that work done to optimize a broad set of benchmarks will result in software that has improved performance on novel problems. This has certainly been our experience with other similar efforts we've done, like the TPC-H effort with dask dataframes

Kirill888 commented 2 weeks ago

A while back I did an investigation of the low-level libraries often used in geospatial python workflows. We were just starting to move to S3 and there wasn't a lot of understanding of load-time performance in the cloud at the time. While I have not tried to run it for a long while now, it might still be useful, if just for historic context:

https://github.com/opendatacube/benchmark-rio-s3/blob/master/report.md

@mrocklin if there is interest from Coiled I would love to chat on the topic of geospatial Dask, and our experience of using Dask in the cloud for large raster-based geospatial workloads.

jrbourbeau commented 2 weeks ago

Thanks all who have engaged here so far. I'm going to convert this issue to a GitHub discussion so we can use threads to better handle multiple conversations going on.

coiled / benchmarks

Large Scale Geospatial Benchmarks #1543