appelmar / gdalcubes

Creating and analyzing Earth observation data cubes in R
https://gdalcubes.github.io
Other
120 stars 28 forks source link

understanding performance in streaming -- when to download first? #93

Open cboettig opened 8 months ago

cboettig commented 8 months ago

Naively, I had thought that for a remote proxy cube, taking advantage of gdalcubes lazy evaluation to perform a computation in a single go would be faster than forcing it to write out to disk half-way, and then reading it back in and continuing. i.e. that:

proxy_cube |> 
    gdalcubes::crop(extent) |> 
    reduce_time("sd(sst)") |>
    plot(col = viridisLite::viridis(10))

would be faster than:

proxy_cube |> 
    gdalcubes::crop(extent) |> 
    write_ncdf("sst.nc")

 ncdf_cube("sst.nc") |>
      reduce_time("sd(sst)") |>
      plot(col = viridisLite::viridis(10))

however, in my current examples I'm seeing over 5x fold improvement in runtime, from over an hour in the first case to about 12 minutes in the second case. Notably, monitoring my network download rates I see the stays steadily around 100Mb/s, whereas in the second case I see rates of 700-999Mb/s (my max), suggesting that the however gdalcubes is computing the temporal reduce in streaming mode is limiting the how fast the network can read in? (But not sure what part of the hardware is the bottleneck here -- CPU does not max out either).

It kinda makes sense to me -- it's still worth doing the crop to download bytes we never touch, but the temporal reduction requires all the data, so maybe it is faster to stream it in big chunks or something to disk and then process it? Do you think that's true in general or just here?

This should be a reprex example for the full data. (Aside, but note we annoyingly need a custom url function wrapper -- I'm thrilled gdalcubes supports that even for stac_image_collection, though here I read from nasa's custom api which is more reliable than nasa's stac api. Also note this example reads netcdf, not cog, because that's what NASA gives us.).


library(earthdatalogin) # remotes::install_github("boettiger-lab/earthdatalogin")
library(gdalcubes)

urls <- edl_search(short_name = "MUR-JPL-L4-GLOB-v4.1",
                   temporal = c("2020-01-01", "2021-12-31"))

vrt <- function(url) {
  prefix <-  "vrt://NETCDF:/vsicurl/"
  suffix <- ":analysed_sst?a_srs=OGC:CRS84&a_ullr=-180,90,180,-90"
  paste0(prefix, url, suffix)
}

gdalcubes_options(parallel = parallel::detectCores()*2)
with_gdalcubes()

url_dates <- as.Date(gsub(".*(\\d{8})\\d{6}.*", "\\1", urls), format="%Y%m%d")

bench::bench_time({
  proxy_cube <- gdalcubes::stack_cube(vrt(urls),
                                   datetime_values = url_dates,
                                   band_names = "sst")
})

extent = list(left=-93, right=-76, bottom=41, top=49,
              t0="2020-01-01", t1="2021-12-31")

bench::bench_time({ # 12 min
  proxy_cube |> 
    gdalcubes::crop(extent) |> 
    write_ncdf("sst.nc")
})

bench::bench_time({
ncdf_cube("sst.nc") |>
  reduce_time("sd(sst)") |>
  plot(col = viridisLite::viridis(10))
})
seamusrobertmurphy commented 7 months ago

Super useful! I was suspicious that this might be the case. Thanks for confirming