fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
303 stars 78 forks source link

handle files with different scale_factor and add_offset #212

Open rsignell-usgs opened 2 years ago

rsignell-usgs commented 2 years ago

It would be nice to handle netcdf files with different scale_factor and add_offset.

I recently used the ECMWF API to generate a bunch of netcdf3 files (in parallel, using dask!) but unfortunately the generated files all have different scale_factor and add_offset.

Here are four files that we would like to virtually aggregate with kerchunk:

14:30 $ aws s3 ls s3://rsignellbucket1/era5_land/ --endpoint https://mghp.osn.xsede.org --no-sign-request
2022-08-18 14:28:56  192215564 conus_2019-12-01.nc
2022-08-18 14:28:58  192215560 conus_2019-12-15.nc
2022-08-18 14:28:58  192215560 conus_2019-12-29.nc
2022-08-18 14:28:59  192215560 conus_2020-01-12.nc
rsignell-usgs commented 2 years ago

Thinking about this some more, I'm not sure this is possible. There is no way in the NetCDF Data model (or Zarr data model) to store metadata like add_offset and scale_factor that changes on a per-chunk basis like this, right @martindurant ?

martindurant commented 2 years ago

I believe it would require a change in zarr/numcodecs to pass context=, about where we are in the overall array, to the codec. I've wanted this kind of thing for a while (including start/stop to be passed to the storage layer), so I suppose I should just do it, rather than go around asking people if they agree it's a good idea.

rsignell-usgs commented 2 years ago

Go Martin Go! It certainly would be handy in this ERA5 API workflow!

rsignell-usgs commented 2 years ago

Just a note that I implemented this workaround to deal with a collection of packed NetCDF3 files with different scale_factor and add_offset variable attributes in each file:

After the dask worker downloads the packed NetCDF3 file, it loads that file into Xarray and computes the bitinformation using xbitinfo. The keepbits for each variable is determined based on 99% retention of real information. The dataset is then bitrounded using that information, and stored as 32bit floats in netcdf4 before pushing to s3.

The resulting bitrounded 32-bit compressed NetCDF4 files are 3 times smaller than the packed 8-bit integer NetCDF3 files!

And of course then multiZarrtoZarr works nicely.

martindurant commented 2 years ago

Well, but you could have saved them as zarr in that case too, right? No one ever said netCDF3 was space efficient...

rsignell-usgs commented 2 years ago

I could have, but since there is no difference in cloud-based access between NetCDF4 and Zarr when using Python/referenceFileSystem, and NetCDF4 files are more friendly for R users, NetCDF4 seems the better choice here!

martindurant commented 2 years ago

That is an interesting argument that maybe should be said more loudly somewhere.

rsignell-usgs commented 2 years ago
mikejohnson51 commented 2 years ago

Hi all - @dblodgett-usgs asked me to consider this from an R perspective. For this type of task (lazy read, access, manipulation) I rely on terra and its GDAL bindings.

The TL/DR is that R doesn't have good (any?) zarr support but does tackle the above challenges pretty well for NetCDF/COG/TIF/STAC/THREDDS type data. I believe that GDAL handles the unpacking of data so the example below of aggregations would apply to that.

Below are some brief examples using the above datasets and the public NWM forcings. The workflow relies on appending the vsis3 (or vsicurl) prefixes to data URLs.

ERA5-land data

# load libraries
library(terra);library(dplyr)
# Here is the link to an s3 file appended with the GDAL vsis3 prefix
url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'

# Lazy Read
(d = rast(url))
#> Warning: [rast] unknown extent

At this stage we don't have data, just structure.

With it we can look at layer names, dimensions and spatial attributes:

#> class       : SpatRaster 
#> dimensions  : 721, 1440, 744  (nrow, ncol, nlyr)
#> resolution  : 0.0006944444, 0.001386963  (x, y)
#> extent      : 0, 1, 0, 1  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> source      : air_pressure_at_mean_sea_level.nc 
#> names       : air_p~vel_1, air_p~vel_2, air_p~vel_3, air_p~vel_4, air_p~vel_5, air_p~vel_6, ...

# Look at layer count and names
nlyr(d)
#> [1] 744

# Look at "third" diminsion
(head(names(d)))
#> [1] "air_pressure_at_mean_sea_level_1" "air_pressure_at_mean_sea_level_2"
#> [3] "air_pressure_at_mean_sea_level_3" "air_pressure_at_mean_sea_level_4"
#> [5] "air_pressure_at_mean_sea_level_5" "air_pressure_at_mean_sea_level_6"

# Look at spatial metadata
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] ""

Slicing through "time" by layer number or name

Data is only fetched on the call to plot

plot(d[[3]])

With no spatial information we can only subset on the unit grid

Here data is only fetched for the lower left quadrent on the call to crop

cr = crop(d[[50]], ext(0,.5,0,.5)) 

#> class       : SpatRaster 
#> dimensions  : 361, 720, 1  (nrow, ncol, nlyr)
#> resolution  : 0.0006944444, 0.001386963  (x, y)
#> extent      : 0, 0.5, 0, 0.5006935  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> source      : memory 
#> name        : air_pressure_at_mean_sea_level_50 
#> min value   :                          96877.31 
#> max value   :                         103217.8

plot(cr)

We can add the spatial metadata if known:

If we do so, then we can subset across multiple dimensions - say for the state of California at interval "40"

ext(d) = c(-180,180,-90,90)
crs(d) = "EPSG:4326"

AOI = AOI::aoi_get(state = "CA") %>% 
  vect() %>% 
  project(crs(d))

plot(crop(d[[40]], AOI))
plot(AOI, add = TRUE)

Aggregating Files

Lets say we have 2 files for January and February of 2020 that we want to read as a collections. We can define those urls, and treat them in aggregate.

Remembering that the first of those had 744 layers above, we will read them together and plot layers 744-745:

url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'
url2 = '/vsis3/era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc'

xx = rast(c(url, url2))
#> Warning: [rast] unknown extent
#> unknown extent

nlyr(xx)
#> [1] 1440

plot(xx[[744:745]])

Created on 2022-08-22 by the reprex package (v2.0.1)

NWM forcing example:

Start with a url endpoint and open a connection:

## URL
u = '/vsis3/noaa-nwm-retrospective-2-1-pds/forcing/1979/197902010000.LDASIN_DOMAIN1'

## Lazy Read
d = rast(u)
#> Warning: [rast] unknown extent

Explore layer slices and spatial properties

nlyr(d)
#> [1] 9
(names(d))
#> [1] "LQFRAC"   "LWDOWN"   "PSFC"     "Q2D"      "RAINRATE" "SWDOWN"   "T2D"     
#> [8] "U2D"      "V2D"
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] ""

Assign missing spatial metadata:

ext(d) = ext(
  -2303999.62876143,
  2304000.37123857,
  -1920000.70008381,
  1919999.29991619
)
crs(d) = 'PROJCS["Sphere_Lambert_Conformal_Conic",GEOGCS["GCS_Sphere",DATUM["D_Sphere",SPHEROID["Sphere",6370000.0,0.0]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["false_easting",0.0],PARAMETER["false_northing",0.0],PARAMETER["central_meridian",-97.0],PARAMETER["standard_parallel_1",30.0],PARAMETER["standard_parallel_2",60.0],PARAMETER["latitude_of_origin",40.000008],UNIT["Meter",1.0]];-35691800 -29075200 126180232.640845;-100000 10000;-100000 10000;0.001;0.001;0.001;IsHighPrecision'

# Plot (fetch data)
plot(d$LWDOWN)

Thats not right!

The data is stored in reverse (topdown in GDAL) -so we can flip it and plot

plot(flip(d$LWDOWN))

plot(flip(crop(d$RAINRATE, project(AOI, crs(d)))))

Screen Shot 2022-08-22 at 7 02 49 PM

So far, we have wrapped a lot of this logic in the package opendap.catalog (name changing soon) to facilitate the discovery of data files, the automation of metadata generation, and the subsetting/aggregating of files.

We have approached this by developing a catalog (https://mikejohnson51.github.io/climateR-catalogs/catalog.json) that stores the spatiotemporal and attribute metadata of discoverable assets and then helps overcome the oddities of url defintion, applying the right spatial metadata, implementing transformations (e.g. flip) and slicing based on space and time. So far its proven to be a very useful pattern for our team.

If anyone would like to talk about this kind of stuff I would love it!

Thanks,

Mike

Created on 2022-08-22 by the reprex package (v2.0.1)

rsignell-usgs commented 2 years ago

@mikejohnson51 this is awesome! Definitely should be a blog post somewhere!

rsignell-usgs commented 2 years ago

I showed this to a colleague and he mentioned https://r-spatial.github.io/stars/ as the "xarray" of R. @mikejohnson51 are you familiar with that one?

dblodgett-usgs commented 2 years ago

@rsignell-usgs -- I've been contributing some stuff to stars to help automated reading of CF-compliant sources. (see https://r-spatial.github.io/stars/reference/read_ncdf.html and https://r-spatial.github.io/stars/articles/stars8.html) Meanwhile, the GDAL multi dimension stuff has come along and my contributions are probably not really that needed anymore since stars is meant to primarily be a wrapper over GDAL.

ivirshup commented 2 years ago

has no one tried to do zarr on R? zaRRR? It is not listed in the "known implementations" over at zarr, but there is julia, JS, java...

I believe bioconductor is doing some work on zarr support in R for OME-NGFF microscopy data. @MSanKeys963 may know more on how that's going.

martindurant commented 1 year ago

An alternative way to achieve this occurred to me. If we allow for preffs's model of multiple references per key, to be read and concatenated on load, then we can add any absolute value into the references set:

{
   "variable/0.1.0": [b'\x00\x00\x80?\x00\x00\x00\x00', [url, offset, size]],
   ...
}

for scale 1, offset 0 as float32. The tricky part is, the concatenated buffer would normally be passed to a decompressor before the scale/offset codecs - but not for the case of netCDF3 (which is uncompressed internally). So all we have to d in this case is make a modified scale/offset codec that takes its params from the buffer, something like what the JSON codec does.

For the general case where there is compression, we would need to make a compression wrapper codec that extracts the values, decompresses the rest, and passed the values back as a tuple (or remakes a combined buffer).

@rabernat , this looks a lot like your idea of including extra per-chunk information in the storage layer metadata

alex-s-gardner commented 1 year ago

+1 for supporting changing scale/offset.