Use kerchunk to join variables and ensembles!?

rsignell-usgs commented 2 years ago

@peterm790, please take a look at https://discourse.pangeo.io/t/efficient-access-of-ensemble-data-on-aws/2530 and see if you think the current version of kerchunk could handle it!

peterm790 commented 2 years ago

@rsignell-usgs I think this could work. I will have a go at using the file names, as in https://github.com/fsspec/GSoC-kechunk-2022/issues/4, append the ensemble member number to the variable names, which I think would allow for creating a single virtual dataset.

peterm790 commented 2 years ago

I suspect there may be a better way to do this, but for now the following will work to create a virtual dataset across all ensemble members: https://nbviewer.org/gist/peterm790/f50179e19ca54de63896d7340bd5c878

This obviously only deals with creating a virtual dataset across the ensemble, not updating when new runs are available

martindurant commented 2 years ago

Nice to show it can be done! I'm pretty sure that you can replace the postprocess with coo_map={"VARNAME": ...} where you would fill in the ellipsis with a regex (re.Pattern) that gets the variable name from the filename.

rsignell-usgs commented 2 years ago

@peterm790 nice work making this workflow compact and efficient!

I was wondering whether it would be possible to create an ensemble dimension using kerchunk instead the ensemble members being in different variables.

martindurant commented 2 years ago

would be possible to create an ensemble dimension

This would have labels rather than coordinates? I don't see why not; we would need to specify an object dtype and maybe somewhere specify the encoding for this (JSON would be fine).

peterm790 commented 2 years ago

@rsignell-usgs I think having the ensembles along a dimension would be neater although I'm not entirely sure if we can add a dimension using kerchunk? I could maybe modify the crs dimension to achieve this?

this will work however:

  flist = fs2.glob('/home/peterm790/shared/users/petermarsh/ensemble_NWM/*.json')

  d_=[]
  for file in flist:
      fs = fsspec.filesystem("reference", fo=file, ref_storage_args={'skip_instance_cache':True},
                         remote_protocol='s3', remote_options={'anon':False})
      d_.append(fs.get_mapper(""))

  ds_kerchunk = xr.open_mfdataset(d_, engine="zarr", backend_kwargs={'consolidated':False}, chunks={},combine = 'nested', concat_dim = [list(range(1,8))])

martindurant commented 2 years ago

It xarray can concat it, kerchunk can :)

rsignell-usgs commented 2 years ago

In xarray concat it's possible to specify a new dimension name.

martindurant commented 2 years ago

I added the following test to the codebase to show that kerchunk already supports this:

def test_var_dimension(refs):
    mzz = MultiZarrToZarr([refs["simple1"], refs["simple_var1"]], remote_protocol="memory",
                          coo_map={"varname": "VARNAME", "var": "output"})
    out = mzz.translate()
    z = xr.open_dataset(
        "reference://",
        backend_kwargs={"storage_options": {"fo": out, "remote_protocol": "memory"},
                        "consolidated": False},
        engine="zarr",
        chunks={}
    )
    assert list(z) == ["output"]
    assert list(z.dims) == ["varname", "x", "y"]
    assert list(z.varname) == ["data", "datum"]
    assert z.output.shape == (2, 10, 10)
    assert (z.output.values == arr).all()

We say that dimension "varname" is created from the names of the variables in the input, and that the variable name in the output is to be "output".

martindurant commented 2 years ago

Will leave this open pending a successful example more complex than my test case.

rsignell-usgs commented 2 years ago

Hopefully @peterm790 can try this out on the National Water Model ensemble members!

peterm790 commented 2 years ago

Okay so I have had some success creating the ensemble using the MultiZarrtoZarr regex method:

  ex = re.compile(r'\d+(?=.json)')

  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex},
                      concat_dims = ['ensemble'],
                      identical_dims = ['reference_time', 'time', 'feature_id'])

(Here's the full work flow: https://gist.github.com/peterm790/c8c08a1e23aacfa007844cbc5affef66)

I did however have to modify the regex used in MultiZarrtoZarr to a search rather than match: https://github.com/fsspec/kerchunk/commit/a37acc00093372e566cea7ed15962ba97afc912d

I am not sure if that is necessary and could well be due to me being quite new to regex, although perhaps having an option for the user to set regex='match' or regex='search' could be a good idea?

Then a final issue is the above results in feature_id being all Nan - I will see if I can get to the bottom of why this is today

peterm790 commented 2 years ago

In fact the above does not work as all variables are only Nan. The following will map all coordinates correctly but variables are all still Nan

  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex},
                      concat_dims = ['feature_id', 'ensemble'],
                      identical_dims = ['reference_time', 'time'])

Which is odd as the mappings within the final json's (for instance ['nudge/0.0.0']) are identical between this and the initial version which renamed the variables using postprocess

martindurant commented 2 years ago

The following regex appears to work: r'.*?(\d+).json'.

Also I see "remote_protocol": "memory" - this should still be "s3", I think, the location of the original data files.

martindurant commented 2 years ago

By the way, you may find it helpful in debugging this kind of thing to turn on logging, e.g.,

fsspec.utils.setup_logging(logger_name="fsspec.reference")  # the reference file systems
fsspec.utils.setup_logging(logger_name="s3fs")  # S3 calls

peterm790 commented 2 years ago

Aah right! Sorry, forgetting to change the "remote_protocol" was a very silly error. In that case the following workflow works to create a virtual dataset from the NWM ensemble: https://nbviewer.org/gist/peterm790/6e4826ed83218c111f71c9fe95e69163

rsignell-usgs commented 2 years ago

Very cool Peter!

Just a note to answer a issue I saw you raised in that notebook -- that you can't create the JSON for the new data without actually accessing the files because the byte ranges change.

Unless the data stored uses no compression (e.g. NetCDF3), the byte ranges will always be changing because the blocks will compress different amounts as the values change. And even for NetCDF3, the byte ranges would likely change because the metadata will be changing. Make sense?

martindurant commented 2 years ago

@rsignell-usgs : I can't remember, is CRS supposed to be considered a coordinate, or is it "identical" in all inputs? It doesn't matter that it has 300 8-byte chunks, but perhaps it's a little untidy.

peterm790 commented 2 years ago

@rsignell-usgs thanks. Yes this makes sense, a pity as it would be pretty neat to automatically generate kerchunks for new model runs

@martindurant CRS is identical across each ensemble member (as well as across time)

rsignell-usgs commented 2 years ago

And yes, as @peterm790 said, CRS will always be just a single value for a variable, and usually just a single value for the entire dataset.

peterm790 commented 2 years ago

Not really sure where I was going with this but just to explore removing CRS as a variable while keeping the CRS attributes I set this up using postprocess and the kerchunk.combine.drop() method:

https://nbviewer.org/gist/peterm790/35ca1858db8981c3d4a234f2f48b5975

  def postprocess(refs):
      with open(flist[0]) as json_file:
          esri_pe_string = eval(ujson.load(json_file)['refs']['crs/.zattrs'])['esri_pe_string']
      crs_attrs = eval(refs['crs/.zattrs'])
      crs_attrs["esri_pe_string"] = esri_pe_string
      refs['crs/.zattrs'] = ujson.dumps(crs_attrs)
      return refs

  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex, 'crs':''},
                      concat_dims = ['ensemble'],
                      identical_dims = ['feature_id', 'reference_time', 'time'],
                      preprocess = kerchunk.combine.drop('crs'),
                      postprocess = postprocess)

  out = mzz.translate()

Again not really sure if this is necessary but maybe a good demonstration of how to remove variables as well as how to modify attributes using postprocess

fsspec / GSoC-kechunk-2022

Use kerchunk to join variables and ensembles!? #8