Closed mdsumner closed 1 year ago
it's insanely fast
library(terra)
plot(project(rast("out.tif"), rast(ext(c(-1, 1, -1 , 1)) * 1e7, res = 25000, crs = "+proj=laea +lon_0=180"), by_util = T) * 1)
plot(project(rast("out.tif"), rast(ext(c(-1, 1, -1 , 1)) * 5e5, res = 200, crs = "+proj=laea +lon_0=147 +lat_0=-46"), by_util = T) * 1)
another option is to use a VRT of the original NetCDF, but create an auxiliary overview-tif that sits side by side, that's all it is a COG with overviews 0, 1, 2, ... (these are subsamples of the source at /2, /4, /8, ...) in a .vrt.ovr file, e.g.
## create a VRT from the above dsn
out <- "20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.vrt.ovr"
f <- vapour::gdal_raster_dsn(dsn, target_dim = c(18000, as.integer(17999/2)), out_dsn = out, options=c("-nomd", "-overwrite", "-co", "COMPRESS=DEFLATE", "-of", "COG", "-co", "BLOCKSIZE=1024", "-co", "NUM_THREADS=ALL_CPUS", "-co", "PREDICTOR=STANDARD", "-co", "RESAMPLING=AVERAGE", "-co", "SPARSE_OK=YES", "-multi"))
this way we can't also normalize to 36000, 18000 but this means no extra storage or commitment to replacing the files (I'm not sure about raw performance with the actual netcdf at native res versus in a tif)
a nearly-ready script, it taks about 210 seconds to do one file, I'm reluctant to parallelize this but will see how my patience holds out
library(raadtools)
nmax <- 366
files <- ghrsstfiles()
# "/rdsi/PUBLIC/raad/data/podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
# files$fullname[1]
outpath <- function(x) {
x <- gsub("podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1",
"idea.public/ghrsst", x)
x <- gsub("\\.nc$", ".tif", x)
## note we use this for safe-ish write
gsub("raad/data", "raad2/data", x)
}
#all(stringr::str_detect(files$fullname, "podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1"))
#outpath(files$fullname[1])
dsnit <- function(x) {
dsn <- dsn::vrtcon(dsn::sds(x, "analysed_sst", "netCDF"),
a_scale=0.001, a_offset = 25,
a_ullr="-180,90,180,-90", a_srs="OGC:CRS84")
dsn
}
warpit <-
purrr::safely(
function(x, out_dsn) {
vapour::gdal_raster_dsn(x, out_dsn = out_dsn,
target_dim = c(36000, 18000),
target_ext = c(-180, 180, -90, 90),
options=c("-nomd", "-overwrite",
"-co", "COMPRESS=DEFLATE",
"-of", "COG",
"-co", "BLOCKSIZE=1024",
"-co", "NUM_THREADS=ALL_CPUS",
"-co", "PREDICTOR=STANDARD",
"-co", "RESAMPLING=AVERAGE",
"-co", "SPARSE_OK=YES",
"-multi"))
}
)
count <- 0
files <- files[nrow(files):1, ]
for (i in seq_along(files$fullname)) {
count <- count + 1
if (count > nmax) break;
dsn <- dsnit( files$fullname[i])
out_dsn <- outpath(files$fullname[i])
if (file.exists(out_dsn)) next;
if (!dir.exists(dirname(out_dsn))) dir.create(dirname(out_dsn), recursive = TRUE)
warpit(dsn, out_dsn)
}
is this a good time to point out that GRHSST hasn't been updated since April?
I guess it's because of https://podaac.jpl.nasa.gov/OPeNDAP-in-the-Cloud ? @raymondben
Newer files are in archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/
(Jan 2023 - current). Suspect I've updated the source when they moved their remote server URL but I haven't merged all our files into a single directory tree (older files are in podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/
).
oh awesome, ty that's all on me to update then, I'll reflect that in the tif path
how do you feel about me making a copy as tifs? I feel like it's ok for now, I'll stick with a recent year or two and experiment (7500 netcdfs is a bit under 350Gb, and slightly larger as tif)
I say copy but it only includes the sst var 🙏
similar for IBCSO, demonstrated the advantages to Patrick Schwarzbach, and will follow up with an doc to illustrate
gdal_translate "/rdsi/PUBLIC/raad/data/download.pangaea.de/dataset/937574/files/IBCSO_v2_ice-surface.tif" ibcso.tif -co COMPRESS=DEFLATE -co PREDICTOR=STANDARD -of COG -co SPARSE_OK=YES
fixed, raadfiles was looking correctly for the files, but the file list was segfaulting for some reason ... could be disk or package versions
raadsync is now updated for system, R, CRAN packages and gh packages, there's ~500Mb free disk on raadsync (I tried resize but doesn't work, so needs a snapshot, relaunch - I'm not sure if there's any hooks to the actual IP address of raadsync)
there's a job on ace-ecostats that checks for new GHRSST, and writes to COGs regularly and should now be reliable enough to point raad to the COGs
having chats with source.coop (RadiantEarth) with Alex Leith about hosting there, or could go in AADC sandbox, or on NCI+Nirin
as per Joshua Sixsmith advice we should use ZSTD
here's an example, all days in 2020 collapsed to an OISST equivalent
https://gist.github.com/mdsumner/c656a4ee88e3d82c55e305a2bcb96944
Here's the incantation
I can't see how to make the overviews have sensible sizes, but it might be worth resampling to a sensible grid size 36000x18000 rather than x17999
Setting a_offset allows us to use the inbuilt scaling, and hence extra compression of integer+DEFLATE, but then GDAL unscaling gives celsius not K.
The tif is equivalent size of the netcdf, and it's much much faster (terra too if use dev with 'by_util=TRUE').
@raymondben this will be a very nice addition to https:// available raad-files, and I think can in practical terms even replace OISST because it's so fast to read/warp from overview, I'll explore setting it up to be generated routinely