Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
511 stars 263 forks source link

Get size of variable along unlimited dimension? #1898

Closed ZedThree closed 2 years ago

ZedThree commented 3 years ago

This is sort of related to #1893, but is it possible add a function to get the size of a variable along an unlimited dimension? It's currently possible to get the the size of the unlimited dimension itself, but it seems like it's not possible to get the current size of a particular variable along the dimension. This means that currently a user needs to keep track of this information themselves, and make it available wherever variables get written.

This is definitely possible for HDF5 files, using H5Sget_simple_extent_dims, and this is certainly used internally -- I think the size of an unlimited dimension is calculated from the maximum extent (in that dimension) of all the variables that use it. So for HDF5 at least, it could just be a matter of wrapping that functionality.

I don't know if it's possible for the other backing file types, which I guess would limit the usefulness of such a function.

DennisHeimbigner commented 3 years ago

In netcdf, each named unlimited dimension always has the same length across all variables. So if you find the current length of the unlimited dimension using e.g. nc_inq_dimlen, then you know the size for all variables. This is a place where netcdf and HDF5 differ and occurs because netcdf has named dimensions and HDF5 does not.

ZedThree commented 3 years ago

Sorry, to be clear I mean "the current extent of a variable in the file along in a given dimension", or perhaps "what is the index of the last record written for a variable".

So while the dimension has the same length, each variable that uses that dimension doesn't. This is a problem when writing different variables in different parts of the program, the user needs to keep track of the latest record for each variable themselves.

Take this rather silly program for instance:

#include <netcdf.h>

int main() {
  int ncid;
  nc_create("test_file.nc", NC_CLOBBER | NC_NETCDF4, &ncid);

  int time_id;
  nc_def_dim(ncid, "time", NC_UNLIMITED, &time_id);

  int dimids[1] = {time_id};

  int var_A_id;
  nc_def_var(ncid, "A", NC_INT, 1, dimids, &var_A_id);

  int var_B_id;
  nc_def_var(ncid, "B", NC_INT, 1, dimids, &var_B_id);

  nc_enddef(ncid);

  size_t count[1] = {1};

  {
    size_t B_start[1] = {0};
    int B = 1;
    nc_put_vara_int(ncid, var_B_id, B_start, count, &B);
  }

  for (int rec = 0; rec < 10; rec++) {
    int A = rec;
    size_t start[1] = {rec};
    nc_put_vara_int(ncid, var_A_id, start, count, &A);
  }

  size_t time_len;
  nc_inq_dimlen(ncid, time_id, &time_len);
  printf("Time dimension is %zu long\n", time_len);

  {
    size_t B_start[1] /* = ??? */;
    int B = 1;
    nc_put_vara_int(ncid, var_B_id, B_start, count, &B);
  }

  nc_close(ncid);

  return 0;
}

This is just to represent writing two variables, A and B, in unrelated parts of the program but using the same unlimited dimension. When it comes to the second call to write B and we want B_start[0] = 1 , but nc_inq_dimlen reports 10. The information about the current extent of B is not accessible, unless we keep track of it ourselves, or use H5Sget_simple_extent_dims directly. To be clear, the information is definitely in the HDF5 file, this is from the output of h5dump:

   DATASET "A" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 10 ) / ( H5S_UNLIMITED ) }
      DATA {
      (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
      }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): (DATASET 331 "/time")
         }
      }
   }
   DATASET "B" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
      DATA {
      (0): 1
      }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): (DATASET 331 "/time")
         }
      }

This is a place where netcdf and HDF5 differ and occurs because netcdf has named dimensions and HDF5 does not.

I'm not sure this is an important distinction here. HDF5 does have dimension scales, which are functionally very similar to netCDF dimensions, although netCDF does not implement its dimensions using HDF5 dimension scales.

edwardhartnett commented 3 years ago

As @DennisHeimbigner noted, an unlimited dimension in netCDF has one size, the size is the maximum of the sizes of all the variables sharing that unlimited dimension. Yes, the user must keep track of what records have been written to what variables. (In practice, most users will write all variable records at the same time, so they generally all have the same extent.)

One solution for what you are seeking would be to use multiple unlimited dimensions. In this case, netCDF can help keep track of what records have been written.

NetCDF does not internally keep track of what has been written to each var. Instead, when you need to know the length of the unlimited dimension, netCDF checks the current size of all vars with that dimension, and gives you the largest size as the length of the dim. If you try and read a variable that has not been written out to that extent, then netcdf gives you back an array filled with fill value. If you try and read beyond the length of the unlimited dim, you get an error.

ZedThree commented 3 years ago

As a bit more of a concrete use case, we have a user-extensible framework for solving PDEs. We write out the time evolving variables in the library, but users can also write out their own auxiliary variables. Our current implementation does write everything out all at once, so the size is always in lockstep, but the current implementation has several downsides, so we're rewriting it. But now we butt into this problem of needing to keep track of the size of individual variables.

Multiple unlimited dimensions is one solution, but this quickly becomes unwieldy, as well as being backwards incompatible with our post-processing tools.

The workaround I'm using at the minute is tracking this via attributes on the variables.

Instead, when you need to know the length of the unlimited dimension, netCDF checks the current size of all vars with that dimension, and gives you the largest size as the length of the dim.

Exactly, so could this be exposed to the user? This is currently implemented through HDF5 calls in hdf5internal.c. It is possible to get at this information by dropping down to call the HDF5 routines directly, but this requires closing the netcdf file, which has its own complications.

I don't know how this works for netCDF-3 files, so perhaps that makes this not possible, but it would be a useful feature for HDF5-backed files.

DennisHeimbigner commented 3 years ago

The workaround I'm using at the minute is tracking this via attributes on the variables.

This actually seems like the right solution to your problem. What problems does it not solve?

DennisHeimbigner commented 3 years ago

Making this visible at the API level is possible, though it technically breaks the netcdf specification. For netcdf-3, the different sizes cannot be detected, so it would have to report that all variables had the same unlimited size. If forced to do this, I would probably do it using a special attribute, "ActualLength\\<dimname>" say. Remember that we would need to handle the case of a variable with multiple unlimited dimensions and provide attributes for all such dimensions.

DennisHeimbigner commented 3 years ago

This is a problem when writing different variables in different parts of the program, the user needs to keep track of the latest record for each variable themselves. I hope this does not mean that you are writing the file from multiple threads.

edwardhartnett commented 3 years ago

I agree with @DennisHeimbigner that exposing this internal information is a violation of the netCDF shared dimension concept, which is based on a fortran I/O record model, in which vars are always written in lockstep, and the file always includes the space for the vars record, even if it it gets fill value - the space is still in the file.

With HDF5, the file is not in fortran I/O record format. Vars that are not written are not written in any way, and no space exists for the data in the file. It's just not there. So we can have vars without the record data.

But the data model says that the unlimited dim has one size. It can't have different sizes for different vars.

Therefore I would say that the user should keep track of what it has written. I think in most codes, the code is aware of the record number of the data it is writing, which is all it needs to know.

edwardhartnett commented 2 years ago

I believe this issue can be closed.