coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

[WIP] Add satellite image processing benchmark #1550

Open jrbourbeau opened 1 week ago

jrbourbeau commented 1 week ago

xref https://github.com/coiled/benchmarks/issues/1548

TomAugspurger commented 2 days ago

Are these benchmarks running in a stable cloud / region? It'd be nice to find a dataset in the same region if possible (to cut down on egress costs and speed up the I/O portion of the benchmark, which probably isn't relevant for dask).

(edit: I see the cluster_kwargs answers that, nice).

On stackstac vs. odc-stac, the main things to be aware of are

  1. they build task graphs differently. Beyond just DataArray vs. Dataset, I think that odc-stac loading includes a groupby stage to ensure that all of the pixels from the same time end up in the same pixel plane (where "same time" is configurable, so that a scene captured a few seconds later can be considered the same if you want).
  2. odc-stac will automatically use overviews if you're requesting lower-resolution data (but not relevant here, since you don't pass resolution=)

I'll give this workload a shot today or tomorrow and will report back.

jrbourbeau commented 2 days ago

Are these benchmarks running in a stable cloud / region?

Right now this is running in westeurope on Azure, which should be where the underlying data is stored, but we can run in any region on AWS, GCP, or Azure.

I'll give this workload a shot today or tomorrow and will report back.

That'd be great. I'm happy to chat generally about this. Also, let me know if you need access to a Coiled workspace that's configured Azure.

jrbourbeau commented 2 days ago

Okay, so here's notebook (https://gist.github.com/jrbourbeau/900b602d19fe8087cafc0490b5c26f68) that runs the same computation using odc.stac. Here's the specific odc.stac.load call

resolution = 10
SHRINK = 4
resolution = resolution * SHRINK

ds = odc.stac.load(
    items,
    chunks={},
    patch_url=planetary_computer.sign,
    resolution=resolution,
    crs="EPSG:3857",
    groupby="solar_day",
)

where I use things like groupby="solar_day", which I saw used in a couple of examples I found. This seems to produce a much smaller graph and is more performant in general.