AustralianAntarcticDivision / raadtools

Tools for Synoptic Environmental Spatial Data
http://australianantarcticdivision.github.io/raadtools/
24 stars 3 forks source link

creating COGs of GHRSST #139

Closed mdsumner closed 1 year ago

mdsumner commented 1 year ago

Here's the incantation

dsn <- dsn::vrtcon(dsn::sds( files$fullname[1], "analysed_sst", "netCDF"), a_scale=0.001, a_offset = 25, a_ullr="-180,90,180,-90", a_srs="OGC:CRS84")

f <- vapour::gdal_raster_dsn(dsn,  out_dsn = "ghrsst/out.tif", options=c("-nomd", "-overwrite", "-co", "COMPRESS=DEFLATE", "-of", "COG", "-co", "BLOCKSIZE=1024", "-co", "NUM_THREADS=ALL_CPUS", "-co", "PREDICTOR=STANDARD", "-co", "RESAMPLING=AVERAGE", "-co", "SPARSE_OK=YES", "-multi"))

I can't see how to make the overviews have sensible sizes, but it might be worth resampling to a sensible grid size 36000x18000 rather than x17999

Setting a_offset allows us to use the inbuilt scaling, and hence extra compression of integer+DEFLATE, but then GDAL unscaling gives celsius not K.

The tif is equivalent size of the netcdf, and it's much much faster (terra too if use dev with 'by_util=TRUE').

library(terra)
terra 1.7.41
>  system.time(project(rast("out.tif"), rast(res = 0.25), by_util = T) * 1)
   user  system elapsed
  0.221   0.132   0.353

@raymondben this will be a very nice addition to https:// available raad-files, and I think can in practical terms even replace OISST because it's so fast to read/warp from overview, I'll explore setting it up to be generated routinely

mdsumner commented 1 year ago

it's insanely fast

library(terra)
plot(project(rast("out.tif"), rast(ext(c(-1, 1, -1 , 1)) * 1e7, res = 25000, crs = "+proj=laea +lon_0=180"), by_util = T) * 1)

plot(project(rast("out.tif"), rast(ext(c(-1, 1, -1 , 1)) * 5e5, res = 200, crs = "+proj=laea +lon_0=147 +lat_0=-46"), by_util = T) * 1)

image

image

mdsumner commented 1 year ago

another option is to use a VRT of the original NetCDF, but create an auxiliary overview-tif that sits side by side, that's all it is a COG with overviews 0, 1, 2, ... (these are subsamples of the source at /2, /4, /8, ...) in a .vrt.ovr file, e.g.

## create a VRT from the above dsn 
out <- "20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.vrt.ovr"
f <- vapour::gdal_raster_dsn(dsn, target_dim = c(18000, as.integer(17999/2)),  out_dsn = out, options=c("-nomd", "-overwrite", "-co", "COMPRESS=DEFLATE", "-of", "COG", "-co", "BLOCKSIZE=1024", "-co", "NUM_THREADS=ALL_CPUS", "-co", "PREDICTOR=STANDARD", "-co", "RESAMPLING=AVERAGE", "-co", "SPARSE_OK=YES", "-multi"))

this way we can't also normalize to 36000, 18000 but this means no extra storage or commitment to replacing the files (I'm not sure about raw performance with the actual netcdf at native res versus in a tif)

mdsumner commented 1 year ago

a nearly-ready script, it taks about 210 seconds to do one file, I'm reluctant to parallelize this but will see how my patience holds out

library(raadtools)
nmax <- 366
files <- ghrsstfiles()

# "/rdsi/PUBLIC/raad/data/podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
# files$fullname[1]
outpath <- function(x) {
  x <- gsub("podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1", 
       "idea.public/ghrsst", x)
  x <- gsub("\\.nc$", ".tif", x)
  ## note we use this for safe-ish write
  gsub("raad/data", "raad2/data", x)
}

#all(stringr::str_detect(files$fullname, "podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1"))
#outpath(files$fullname[1])
dsnit <- function(x) {
  dsn <- dsn::vrtcon(dsn::sds(x, "analysed_sst", "netCDF"), 
                     a_scale=0.001, a_offset = 25, 
                     a_ullr="-180,90,180,-90", a_srs="OGC:CRS84")
  dsn
}
warpit <- 
purrr::safely(  
  function(x, out_dsn) {

  vapour::gdal_raster_dsn(x,  out_dsn = out_dsn, 
                          target_dim = c(36000, 18000), 
                          target_ext = c(-180, 180, -90, 90),
                          options=c("-nomd", "-overwrite", 
                                    "-co", "COMPRESS=DEFLATE", 
                                    "-of", "COG", 
                                    "-co", "BLOCKSIZE=1024", 
                                    "-co", "NUM_THREADS=ALL_CPUS", 
                                    "-co", "PREDICTOR=STANDARD", 
                                    "-co", "RESAMPLING=AVERAGE", 
                                    "-co", "SPARSE_OK=YES", 
                                    "-multi"))

}
)
count <- 0
files <- files[nrow(files):1, ]
for (i in seq_along(files$fullname)) {
  count <- count + 1
  if (count > nmax) break;

  dsn <- dsnit( files$fullname[i])
  out_dsn <- outpath(files$fullname[i])

  if (file.exists(out_dsn)) next;
  if (!dir.exists(dirname(out_dsn))) dir.create(dirname(out_dsn), recursive = TRUE)
  warpit(dsn, out_dsn)
}
mdsumner commented 1 year ago

is this a good time to point out that GRHSST hasn't been updated since April?

I guess it's because of https://podaac.jpl.nasa.gov/OPeNDAP-in-the-Cloud ? @raymondben

raymondben commented 1 year ago

Newer files are in archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/ (Jan 2023 - current). Suspect I've updated the source when they moved their remote server URL but I haven't merged all our files into a single directory tree (older files are in podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/).

mdsumner commented 1 year ago

oh awesome, ty that's all on me to update then, I'll reflect that in the tif path

how do you feel about me making a copy as tifs? I feel like it's ok for now, I'll stick with a recent year or two and experiment (7500 netcdfs is a bit under 350Gb, and slightly larger as tif)

mdsumner commented 1 year ago

I say copy but it only includes the sst var 🙏

mdsumner commented 1 year ago

similar for IBCSO, demonstrated the advantages to Patrick Schwarzbach, and will follow up with an doc to illustrate

gdal_translate  "/rdsi/PUBLIC/raad/data/download.pangaea.de/dataset/937574/files/IBCSO_v2_ice-surface.tif" ibcso.tif -co COMPRESS=DEFLATE -co PREDICTOR=STANDARD -of COG -co SPARSE_OK=YES
mdsumner commented 1 year ago

fixed, raadfiles was looking correctly for the files, but the file list was segfaulting for some reason ... could be disk or package versions

raadsync is now updated for system, R, CRAN packages and gh packages, there's ~500Mb free disk on raadsync (I tried resize but doesn't work, so needs a snapshot, relaunch - I'm not sure if there's any hooks to the actual IP address of raadsync)

there's a job on ace-ecostats that checks for new GHRSST, and writes to COGs regularly and should now be reliable enough to point raad to the COGs

having chats with source.coop (RadiantEarth) with Alex Leith about hosting there, or could go in AADC sandbox, or on NCI+Nirin

mdsumner commented 1 year ago

as per Joshua Sixsmith advice we should use ZSTD

mdsumner commented 1 year ago

here's an example, all days in 2020 collapsed to an OISST equivalent

https://gist.github.com/mdsumner/c656a4ee88e3d82c55e305a2bcb96944