Use RNetCDF instead of ncdf4

paleolimbot commented 3 years ago

Some preliminary work on this suggests that the read functions can be about 10x faster than they are with ncdf4!

paleolimbot commented 3 years ago

More work suggests that this difference decreases as files get bigger (a 700 KB Sprof file can only be read 2x faster with RNetCDF)

paleolimbot commented 3 years ago

library(argodata)
argo_read_prof_levels2 <- argodata:::argo_read_prof_levels2

nc_tiny <- system.file("cache-test/dac/csio/2900313/profiles/D2900313_000.nc", package = "argodata")
nc_big <- system.file("cache-test/dac/csio/2902746/2902746_Sprof.nc", package = "argodata")

waldo::compare(
  argo_read_prof_levels(nc_tiny),
  argo_read_prof_levels2(nc_tiny)
)
#> `names(old)[1:5]`: "N_PROF" "N_LEVELS"          "PRES" "PRES_QC" "PRES_ADJUSTED"
#> `names(new)[1:5]`:          "N_LEVELS" "N_PROF" "PRES" "PRES_QC" "PRES_ADJUSTED"

waldo::compare(
  argo_read_prof_levels(nc_big),
  argo_read_prof_levels2(nc_big)
)
#> `names(old)[1:5]`: "N_PROF" "N_LEVELS"          "PRES" "PRES_QC" "PRES_ADJUSTED"
#> `names(new)[1:5]`:          "N_LEVELS" "N_PROF" "PRES" "PRES_QC" "PRES_ADJUSTED"

bench::mark(
  argo_read_prof_levels(nc_tiny),
  argo_read_prof_levels2(nc_tiny),
  check = F
)
#> # A tibble: 2 x 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 argo_read_prof_levels(nc_tiny)   20.75ms  21.74ms      45.6   222.9KB     28.1
#> 2 argo_read_prof_levels2(nc_tiny)   2.47ms   2.79ms     337.     12.9KB     17.9

bench::mark(
  argo_read_prof_levels(nc_big),
  argo_read_prof_levels2(nc_big),
  check = F
)
#> # A tibble: 2 x 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 argo_read_prof_levels(nc_big)    46.2ms   46.6ms      20.9    6.79MB    17.5 
#> 2 argo_read_prof_levels2(nc_big)   25.8ms   27.2ms      35.7    3.76MB     2.10

^{Created on 2021-04-07 by the reprex package (v0.3.0)}

mdsumner commented 3 years ago

just otoh, ncdf4 is blazing fast at opening a file and retrieving all the metadata in an object - but sometimes that is costly to grab (classic is GMT files that would trigger a huge coords read unnecessarily) so you might be seeing that conflation.

also, RNetCDF is active and under development in public, whereas ncdf4 dev is closed, seems stalled, and has bad relations with CRAN.

paleolimbot commented 3 years ago

Thank you for this...I had no idea on the backstory of ncdf4! I think you're bang on with the cost of grabbing all the metadata...many Argo files are only a few kb and reading the metadata is the cost of reading the file. For a few thousand profiles it makes a big difference!

paleolimbot commented 3 years ago

This does seem like a pretty huge quality-of-life improvement for some common applications:

library(argodata)
argo_read_prof_levels2 <- argodata:::argo_read_prof_levels2
library(dplyr)
set.seed(393)
files <- argo_global_prof() %>% 
  sample_n(1500) %>% 
  argo_download()

system.time(thing <- argo_map(files, argo_read_prof_levels))
#>    user  system elapsed 
#>   38.56    2.22   41.17
system.time(thing <- argo_map(files, argo_read_prof_levels2))
#>    user  system elapsed 
#>    5.92    2.30   10.50

paleolimbot commented 3 years ago

Looks like these are both huge improvements over oce::read.argo() (with added possibility to only read in a few variables at a time for speed):

bench::mark(
  oce::read.argo(files[1]),
  argo_read_prof_levels(files[1]),
  argo_read_prof_levels2(files[1]),
  check = F
)
#> # A tibble: 3 x 6
#>   expression                            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 oce::read.argo(files[1])          69.92ms  72.45ms      13.3   386.8KB     2.22
#> 2 argo_read_prof_levels(files[1])   19.05ms  19.89ms      47.9   223.3KB     7.56
#> 3 argo_read_prof_levels2(files[1])   2.52ms   3.01ms     299.     13.2KB     4.21

mdsumner commented 3 years ago

probably too late by now ... but did you look at ncmeta? The idea was that gets all the metadata (with efficiency options to control when and what) for general use. It drove me crazy for years navigating the netcdf api via RNetCDF, and the hideous nested connection object generated by ncdf4. It led me to the idea of grids (sets of dims) and axes (instances of dims) that are handy for tidync, but not in general use as concepts afaik.

paleolimbot commented 3 years ago

This package 100% rips off that (your) idea (the first thing I did when I started this job was to run tidync::tidync() on an Argo file and read your vignette!). Now that I look at ncmeta it's totally what I was after but I ran into ncdf4 first. As you noted, RNetCDF is a little too low-level but with a tiny wrapper ( https://github.com/ArgoCanada/argodata/blob/master/R/rnetcdf.R ) it's very fast and leads to well-abstracted code. I'll investigate ncmeta in more detail tomorrow! Argo has some custom "string" dimensions that might make it so I have to keep using my tiny wrapper (or maybe you're way ahead of me on this one).

In case you're interested in the design and how I totally ripped off your "grids" concept to abstract Argo, there are four types of files (proj, traj, meta, tech), each of which have a few "grids". There's one read function for each grid and single read function for all the scalar variables (argo_info()). The grids thing means I never have to hard-code variable names and only have to update the read functions when the dimensions get updated (very rarely). They're also only one line each! There's the argo_read_(prof|traj|meta|tech)_<grid_name>() family of functions (reads one NetCDF file) and argo_(prof|traj|meta|tech)_<grid_name>() (which applies some editorial tidying and reads more than one file at a time).

mdsumner commented 3 years ago

oh cool, I really have to have a closer look - I've tangled with argo for 20 years but never needed to really hit it, it's such an R-suitable problem and has practically no "spatial" analog ;)

I'll look at the string dimensions that's definitely come up, though OTOH I can only think of 'NC_CHAR' variables (which were dimensioned inconsistently iirc) and Scalar variables which I mistakenly had as axes, might be able to get some clarity ...

paleolimbot commented 3 years ago

I only know enough about the string thing to read Argo files...some NC_CHAR variables are intended as "one row per character" and some are intended as "one row per string value" (they have a dummy first dimension that is dropped by default in both ncdf4 and RNetCDF). I can't find any convention on the naming of the dummy first dimension although in Argo they're name STRINGXX (2, 4, 8 ... 256) and DATE_TIME.

mdsumner commented 3 years ago

yes the dummy-dropped thing, it stymied me for ages ...

paleolimbot commented 3 years ago

Realizing that the above sounds like I thought you didn't know this stuff already...I just got excited at the chance to nerd out about this stuff!

mdsumner commented 3 years ago

Oh totally, I'm not crazy across it I just go deep enough sometimes ... rarely have a taste for it anymore 😝

But R needs better netcdf and netcdf needs better R I know that!

ArgoCanada / argodata

Use RNetCDF instead of ncdf4 #7