ExaESM-WP4 / 2022-nczarr-eval

Meta repo for the nczarr eval project
1 stars 0 forks source link

SWM with nczarr outputs #1

Open benbovy opened 2 years ago

benbovy commented 2 years ago

@willirath @martinclaus A few comments after having tried running the SWM model with nczarr outputs:

First, installing libnetcdf and netcdf-fortran from conda-forge seems to work well, so no need to build netcdf from source! (at least for writing zarr datasets in file directories as I don't think the conda-forge packages support S3 yet). libnetcdf is now built with nczarr support (https://github.com/conda-forge/libnetcdf-feedstock/issues/117) and netcdf-fortran just forwards the path to the C library. I had to install netcdf-fortran version 4.5.3 because of this issue: https://github.com/conda-forge/netcdf-fortran-feedstock/issues/71.

Then I used the following model settings in model.namelist (use case 1 from this repo):

  oprefix = "file:///path/to/use_case_1_zarr/nc_",
  osuffix = "#mode=xarray,file",

It didn't work out of the box as nczarr doesn't support unlimited dimensions, which is used in SWM for the time axis.

After editing the SWM source to use an arbitrary dimension size instead (a very dirty fix), I've been able to run the model until the end and read the output dataset with Xarray:

ds = xr.open_dataset("nc_000000000001_eta_out.nc", engine="zarr", consolidated=False)
ds
# <xarray.Dataset>
# Dimensions:    (LATITUDE: 61, LONGITUDE: 101, TIME: 300)
# Coordinates:
#   * LATITUDE   (LATITUDE) float64 15.25 15.75 16.25 16.75 ... 44.25 44.75 45.25
#   * LONGITUDE  (LONGITUDE) float64 -49.75 -49.25 -48.75 ... -0.75 -0.25 0.25
#   * TIME       (TIME) datetime64[ns] 2000-01-31T10:30:00 NaT NaT ... NaT NaT NaT
# Data variables:
#     eta        (TIME, LATITUDE, LONGITUDE) float64 -0.4064 -0.3854 ... nan nan
# Attributes:
#     _NCProperties:  version=2,netcdf=4.8.1,nczarr=2.0.0

However, all stored values for time steps > 1 are NaN, which I guess is because simply changing an unlimited dimension to a fixed size dimension is not enough (I'm not familiar with the netcdf C/Fortran APs, though). @martinclaus do you think it would be possible to avoid using unlimited dimensions in SWM without refactoring its internals too heavily?

martinclaus commented 2 years ago

I think switching to fixed sized dimension should be no problem. The number of output time steps is deterministic for all possible use-cases and the netCDF API does not distinguish between appending to an unlimited dimension or writing to a fixed-length dimension. I will take a look in the next days.

martinclaus commented 2 years ago

Ref https://git.geomar.de/swm/swm/-/issues/42

benbovy commented 2 years ago

After a few more tests, the issue with NaN values for all time steps expect the 1st one seems to be that calling nf90_put_var will not write any data in a zarr chunk that has been already initialized (not sure if that's intentional or not), so we need to ensure that each call will write in a separate chunk, e.g., by setting chunksize = 1 along the time dimension.

In addition to setting the nc_chunksizes parameter in model.namelist, I had to edit some code in the io_module to make it work (also for chunking the time coordinate).

I also edited the code to set fill values using nf90_def_var_fill instead of manually setting a missing_value attribute (to fix Xarray warnings when reading the output zarr dataset).

I can send you a patch file if that's useful (my edits are not super clean, though).

Here's a comparison of outputs in both formats:

SWM test netcdf4 Screenshot 2022-05-05 at 12 31 38

It looks pretty much the same (I don't know yet why the values at the domain boundary didn't get masked by Xarray with the zarr format, those values are all 1e36 while the defined fill value is 1e37).

martinclaus commented 2 years ago

A possible reason for the behaviour could be that the SWM is closing the dataset each time it writes to it. This is controlled by the DIAG_FLUSH macro defined in include/io.h. Worth a try if that avoids having to set the chunksize.

A patch file would be great.

benbovy commented 2 years ago

Yes indeed, DIAG_FLUSH switched off and all elements are correctly filled in a single chunk.

martinclaus commented 2 years ago

I have added an option for fixed sized time dimension output. The image at docker hub is already updated. Just add unlimited_tdim=.false. to the diag_nl namelists.