Closed paleolimbot closed 3 years ago
More work suggests that this difference decreases as files get bigger (a 700 KB Sprof file can only be read 2x faster with RNetCDF)
library(argodata)
argo_read_prof_levels2 <- argodata:::argo_read_prof_levels2
nc_tiny <- system.file("cache-test/dac/csio/2900313/profiles/D2900313_000.nc", package = "argodata")
nc_big <- system.file("cache-test/dac/csio/2902746/2902746_Sprof.nc", package = "argodata")
waldo::compare(
argo_read_prof_levels(nc_tiny),
argo_read_prof_levels2(nc_tiny)
)
#> `names(old)[1:5]`: "N_PROF" "N_LEVELS" "PRES" "PRES_QC" "PRES_ADJUSTED"
#> `names(new)[1:5]`: "N_LEVELS" "N_PROF" "PRES" "PRES_QC" "PRES_ADJUSTED"
waldo::compare(
argo_read_prof_levels(nc_big),
argo_read_prof_levels2(nc_big)
)
#> `names(old)[1:5]`: "N_PROF" "N_LEVELS" "PRES" "PRES_QC" "PRES_ADJUSTED"
#> `names(new)[1:5]`: "N_LEVELS" "N_PROF" "PRES" "PRES_QC" "PRES_ADJUSTED"
bench::mark(
argo_read_prof_levels(nc_tiny),
argo_read_prof_levels2(nc_tiny),
check = F
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 argo_read_prof_levels(nc_tiny) 20.75ms 21.74ms 45.6 222.9KB 28.1
#> 2 argo_read_prof_levels2(nc_tiny) 2.47ms 2.79ms 337. 12.9KB 17.9
bench::mark(
argo_read_prof_levels(nc_big),
argo_read_prof_levels2(nc_big),
check = F
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 argo_read_prof_levels(nc_big) 46.2ms 46.6ms 20.9 6.79MB 17.5
#> 2 argo_read_prof_levels2(nc_big) 25.8ms 27.2ms 35.7 3.76MB 2.10
Created on 2021-04-07 by the reprex package (v0.3.0)
just otoh, ncdf4 is blazing fast at opening a file and retrieving all the metadata in an object - but sometimes that is costly to grab (classic is GMT files that would trigger a huge coords read unnecessarily) so you might be seeing that conflation.
also, RNetCDF is active and under development in public, whereas ncdf4 dev is closed, seems stalled, and has bad relations with CRAN.
Thank you for this...I had no idea on the backstory of ncdf4! I think you're bang on with the cost of grabbing all the metadata...many Argo files are only a few kb and reading the metadata is the cost of reading the file. For a few thousand profiles it makes a big difference!
This does seem like a pretty huge quality-of-life improvement for some common applications:
library(argodata)
argo_read_prof_levels2 <- argodata:::argo_read_prof_levels2
library(dplyr)
set.seed(393)
files <- argo_global_prof() %>%
sample_n(1500) %>%
argo_download()
system.time(thing <- argo_map(files, argo_read_prof_levels))
#> user system elapsed
#> 38.56 2.22 41.17
system.time(thing <- argo_map(files, argo_read_prof_levels2))
#> user system elapsed
#> 5.92 2.30 10.50
Looks like these are both huge improvements over oce::read.argo()
(with added possibility to only read in a few variables at a time for speed):
bench::mark(
oce::read.argo(files[1]),
argo_read_prof_levels(files[1]),
argo_read_prof_levels2(files[1]),
check = F
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 oce::read.argo(files[1]) 69.92ms 72.45ms 13.3 386.8KB 2.22
#> 2 argo_read_prof_levels(files[1]) 19.05ms 19.89ms 47.9 223.3KB 7.56
#> 3 argo_read_prof_levels2(files[1]) 2.52ms 3.01ms 299. 13.2KB 4.21
probably too late by now ... but did you look at ncmeta? The idea was that gets all the metadata (with efficiency options to control when and what) for general use. It drove me crazy for years navigating the netcdf api via RNetCDF, and the hideous nested connection object generated by ncdf4. It led me to the idea of grids (sets of dims) and axes (instances of dims) that are handy for tidync, but not in general use as concepts afaik.
This package 100% rips off that (your) idea (the first thing I did when I started this job was to run tidync::tidync()
on an Argo file and read your vignette!). Now that I look at ncmeta it's totally what I was after but I ran into ncdf4 first. As you noted, RNetCDF is a little too low-level but with a tiny wrapper ( https://github.com/ArgoCanada/argodata/blob/master/R/rnetcdf.R ) it's very fast and leads to well-abstracted code. I'll investigate ncmeta in more detail tomorrow! Argo has some custom "string" dimensions that might make it so I have to keep using my tiny wrapper (or maybe you're way ahead of me on this one).
In case you're interested in the design and how I totally ripped off your "grids" concept to abstract Argo, there are four types of files (proj, traj, meta, tech), each of which have a few "grids". There's one read function for each grid and single read function for all the scalar variables (argo_info()
). The grids thing means I never have to hard-code variable names and only have to update the read functions when the dimensions get updated (very rarely). They're also only one line each! There's the argo_read_(prof|traj|meta|tech)_<grid_name>()
family of functions (reads one NetCDF file) and argo_(prof|traj|meta|tech)_<grid_name>()
(which applies some editorial tidying and reads more than one file at a time).
oh cool, I really have to have a closer look - I've tangled with argo for 20 years but never needed to really hit it, it's such an R-suitable problem and has practically no "spatial" analog ;)
I'll look at the string dimensions that's definitely come up, though OTOH I can only think of 'NC_CHAR' variables (which were dimensioned inconsistently iirc) and Scalar variables which I mistakenly had as axes, might be able to get some clarity ...
I only know enough about the string thing to read Argo files...some NC_CHAR variables are intended as "one row per character" and some are intended as "one row per string value" (they have a dummy first dimension that is dropped by default in both ncdf4 and RNetCDF). I can't find any convention on the naming of the dummy first dimension although in Argo they're name STRINGXX (2, 4, 8 ... 256) and DATE_TIME.
yes the dummy-dropped thing, it stymied me for ages ...
Realizing that the above sounds like I thought you didn't know this stuff already...I just got excited at the chance to nerd out about this stuff!
Oh totally, I'm not crazy across it I just go deep enough sometimes ... rarely have a taste for it anymore 😝
But R needs better netcdf and netcdf needs better R I know that!
Some preliminary work on this suggests that the read functions can be about 10x faster than they are with ncdf4!