Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
511 stars 263 forks source link

Incorrect fill value being used for enums (was: ncdump failure on enum file: dumplib.c; line 952) #982

Open edhartnett opened 6 years ago

edhartnett commented 6 years ago

I found this while writing some tests for parallel. But it also happens in non-parallel mode. This happens on current master.

I have added a test to nc_test4/tst_enum.c which creates a very simple file with one enum variable. Then one value is written.

Here's the test code:

#define ENUM_NAME "cargo"
#define ENUM_VAR_NAME "in_the_hold_of_the_Irish_Rover"
#define NUM_ENUM_FIELDS 8
#define DIMSIZE 4
#define NDIMS1 1
   {
      int ncid, dimid, v1id, typeid;
      int f;
      size_t start[NDIMS1] = {0}, count[NDIMS1] = {1};
      char field_name[NUM_ENUM_FIELDS][NC_MAX_NAME + 1] = {"bags of the best Sligo rags", "barrels of bones",
                                                           "bails of old nanny goats' tails", "barrels of stones",
                                                           "dogs", "hogs", "barrels of porter",
                                                           "sides of old blind horses hides"};
      unsigned long long field_value[NUM_ENUM_FIELDS] = {1000000, 2000000, 3000000, 4000000,
                                                         5000000, 6000000, 7000000, 8000000};
      unsigned long long data = 1000000, data_in;

      /* Create a netcdf-4 file. */
      /*nc_set_log_level(3);*/
      if (nc_create(FILE_NAME, NC_NETCDF4, &ncid)) ERR;

      /* Create a dimension. */
      if (nc_def_dim(ncid, DIM_NAME, DIMSIZE, &dimid)) ERR;

      /* Create an enum type. */
      if (nc_def_enum(ncid, NC_UINT64, ENUM_NAME, &typeid)) ERR;
      for (f = 0; f < NUM_ENUM_FIELDS; f++)
         if (nc_insert_enum(ncid, typeid, field_name[f], &field_value[f])) ERR;

      /* Create one var. */
      if (nc_def_var(ncid, ENUM_VAR_NAME, typeid, NDIMS1, &dimid, &v1id)) ERR;

      /* Write metadata to file. */
      if (nc_enddef(ncid)) ERR;

      /* Write phoney data. */
      if (nc_put_vara(ncid, v1id, start, count, &data)) ERR;

      if (nc_sync(ncid)) ERR;

      /* Read phoney data. */
      if (nc_get_vara(ncid, v1id, start, count, &data_in)) ERR;
      if (data_in != data) ERR;

      /* Close the netcdf file. */
      if (nc_close(ncid)) ERR;
   }

The test succeeds without problem. But when ncdumping the resulting file I get:

netcdf tst_enums {
types:
  uint64 enum cargo {bags\ of\ the\ best\ Sligo\ rags = 1000000, 
      barrels\ of\ bones = 2000000, 
      bails\ of\ old\ nanny\ goats\'\ tails = 3000000, 
      barrels\ of\ stones = 4000000, dogs = 5000000, hogs = 6000000, 
      barrels\ of\ porter = 7000000, 
      sides\ of\ old\ blind\ horses\ hides = 8000000} ;
dimensions:
    dim = 4 ;
variables:
    cargo in_the_hold_of_the_Irish_Rover(dim) ;
data:

NetCDF: Invalid argument
Location: file dumplib.c; line 952
 in_the_hold_of_the_Irish_Rover = bags of the best Sligo rags, 

(Could this be related to the spaces in the enum field names? Or could it be when it is trying to print the default fill value? Because I only write one data element to the array, leaving the rest as fill value. And it is in printing the fist fill value that ncdump fails.)

This code is in nc_test4/tst_enums.c in my currently open PR #979. I have also added a test to the ncdump directory by adding this netCDF file as ref_tst_irish_rover.nc and then added a line (currently commented out) to tst_netcdf4.sh:

echo "*** creating tst_output_irish_rover.cdl from ref_tst_irish_rover.nc..."
#${NCDUMP} ref_tst_irish_rover.nc > tst_output_irish_rover.cdl

This currently fails, since ncdump returns 1 when it encounters this problem.

edhartnett commented 6 years ago

See also #734, which may be related.

DennisHeimbigner commented 6 years ago

Or could it be when it is trying to print the default fill value? Because I only write one data element to the array, leaving the rest as fill value. This turns out to be the problem. There is a semantic flaw in fill value defaulting for enums. The fill value defaults to zero, but it should actually be defaulting to a legitimate enum constant value. Using a value that is out of the enum values is semantically incorrect. So, when ncdump encounters the zero fill value, it cannot convert it to an enum constant and fails as you show. Not sure what is the correct solution. I would probably say that if fill value is defaulted, then the first legitimate enum constant is chosen as the enum fill value. Alternatively, we could choose the laste enum constant.

edhartnett commented 6 years ago

Or we could return error if user does a write of enum that requires a fill value, and no fill value is provided.

DennisHeimbigner commented 6 years ago

That is also a possibility. It occurs to me that it is vary difficult to tell if a fill value is needed until the variable is sync'd to disk and the file is closed. Even then, it requires looking at every value in the variable to decide. This will be true no matter what we do because a user could use e.g. nc_put_var1 to store data at various points in the variable. The only thing I can think of is to make hdf5 track the fill value for us always. This means that all enum typed variables have fill set automatically.

edhartnett commented 6 years ago

OK, I have been pondering and I don't think using the first or last enum value as a fill value is a good idea. It would be seriously misleading.

Agree that it is impossible to know when a fill value is needed. If I define a file, and write no data, then ncdump will need to show fill values.

We could require a fill value for enums on enddef. That is, if the user defines a var of type enum, and does not define an _FillValue attribute, then nc_enddef() fails. Crude but definitive.

DennisHeimbigner commented 6 years ago

Another possibility is to define a special _UNDEFINED enum constant that is implicitly part of every enum type. Sort of the equivalent of NAN for floats. If you ask for the numeric value of the _UNDEFINED, then the first negative integer not used in the enum would be returned.

edhartnett commented 6 years ago

Another thing to consider is a HDF5 file that is not a netCDF-4 file which will be using 0 for fill value, which is the HDF5 default (and a bummer of a default, since 0 is usually a perfectly valid value for many science quantities.)

DennisHeimbigner commented 6 years ago

Good point. That is another argument for an _UNDEFINED enum constant.

edhartnett commented 6 years ago

If we use zero, we will maintain compatibility with all existing cases, and HDF5.

I am not crazy about 0 as a fill value, but I think it's less confusing to just use it like HDF5 in this case, and document it well.

Instead of calling it _UNDEFINED why not _FillValue?

DennisHeimbigner commented 6 years ago

I send a request to hdfgroup to see if they have any insight. The question I would have is: is it possible to create an hdf5 file in which the enum variable has a non-zero value that is not a one of the enum constant values. For example, is it possible to do H5Pset_fill_value using a value that is not one of the enum constants. Calling the illegal value _FillValue would be ok as long as there could never be a value in the variable other than zero (the default fill value) or a value set using H5Pset_fill_value.

DennisHeimbigner commented 6 years ago

I did a quick experiment and it appears to be the case that it is possible to set any arbitrary value as the fill value for an enum typed variable. But this would be ok as long as there is no other way to insert an illegal value into an enum typed variable.

DennisHeimbigner commented 6 years ago

One more thought. For the netcdf API, we can enforce that setting a fill value requires setting it to a legal enum constant value.

edhartnett commented 6 years ago

Here's what I think we should do:

1 - For netCDF API enforce that fill value for enums is a valid value for that enum. 2 - ncdump should be changed to understand that 0 may appear in any enum, even if it is not a valid value for that enum. Sadly this is not ideal if 0 is a valid value, and is also being used by HDF5 as a fill value. But the user must set _FillValue to avoid this. 3 - Documentation of nc_def_enum() to be updated to ensure all this is explained. (Also can mention that VLENs are not supported for parallel I/O writes, as I have just learned.)

@DennisHeimbigner does this make sense? If so, I will undertake 1 and 3 if you will take care of 2.

DennisHeimbigner commented 6 years ago

It is not just ncdump. We need to decide at the API level how to handle zero by, for example, using _FillValue as the enum constant name for it.

edhartnett commented 6 years ago

Well the API takes the easy way out. It does not check for validity of enum values at all. When you do an nc_get_vara() for an enum, and provide an address of an integer array, it gives you integers. It does not check them or guarantee that they are valid. ;-)

So the API already does what we want, I think which is to give zeros where there are zeros, even if that is not a valid value for that enum.

Unfortunately using _FillValue as an enum field name will not work. The problem is that enums are defined as a type, and then used in vars. The vars may or may not have fill values. So nothing can be done at the type level. We cannot add a _FillValue to the enum, because that changes the enum. We would break compatibility with HDF5 native use of enums.

Also we cannot know when the type is defined if the user will later use _FillValue attributes. In which case, adding an extra field called _FillValue will not be helpful.

DennisHeimbigner commented 6 years ago

The problem at the API level occurs when a program reads a set of enum values from a variable and then uses e.g. nc_inq_enum_ident() to get the corresponding name. What do we do when zero is passed in?

edhartnett commented 6 years ago

Ha! Good one. I didn't think of that. OK, we could pass back "_FillValue" or "_UNDEFINED". Which do you prefer?

DennisHeimbigner commented 6 years ago

The hdfgroup has asked for a program showing the problem, so I sent them a simple netcdf program. Let us see what they say.

DennisHeimbigner commented 6 years ago

So I heard from the hdf group and they basically denied there is a problem. So, I guess the only thing remaining for us to do is choose a name for values that have no corresponding name.

edhartnett commented 6 years ago

It's a feature not a bug! ;-)

I will make the modifcations. I will use "_FillValue" as the name, since in all cases, that should be the only reason that one of the enum values is undefined. If you want me to use "UNDEFINED" that's fine too. Whatevs.

I will do this after some of the outstanding PRs get merged. I think I have a bit too many open for @WardF right now. ;-)

WardF commented 6 years ago

From a conversation this morning I think we are leaning towards UNDEFINED :). Thanks, getting to the PR's as quick as I can :).

MTG-Formats commented 2 years ago

My team at EUMETSAT have started to encounter this issue with some of the products we're working on - the default _FillVlaue having no equivalent enum value assigned causing ncdump to crash. This ticket is still open but hasn't seen much action for a few years now. Are there any plans to implement the suggested fix ? Cheers!

DennisHeimbigner commented 2 years ago

This slipped thru the cracks, sorry.

DennisHeimbigner commented 2 years ago

I have a fix PR. Using a variant of Ed's test case (where the variable data is all zero except the first element). ncdump now produces the output below. Is this acceptable solution?

netcdf t {
types:
  uint64 enum cargo {bags\ of\ the\ best\ Sligo\ rags = 1000000, 
      barrels\ of\ bones = 2000000, 
      bails\ of\ old\ nanny\ goats\'\ tails = 3000000, 
      barrels\ of\ stones = 4000000, dogs = 5000000, hogs = 6000000, 
      barrels\ of\ porter = 7000000, 
      sides\ of\ old\ blind\ horses\ hides = 8000000} ;
dimensions:
    dim = 4 ;
variables:
    cargo in_the_hold_of_the_Irish_Rover(dim) ;
data:

 in_the_hold_of_the_Irish_Rover = bags of the best Sligo rags, _UNDEFINED, 
    _UNDEFINED, _UNDEFINED ;
}
DennisHeimbigner commented 2 years ago

Fixed by PR https://github.com/Unidata/netcdf-c/pull/2462

bzah commented 1 year ago

Hi, It seems we are still able to create variables that have a fill_value outside the range of the enum possible values.

I hope it's alright to give an example using the python netCDF4 lib. If needed, I can probably create a C example too:

"""[MRE] invalid fillvalue for variable typed by an enum."""

import netCDF4 as nc

print(nc.__version__)
# --> '1.6.2'             

ds = nc.Dataset("./example.nc", "w")
my_enum = ds.createEnumType("u1","my_enum",{"a":0, "b":1})
ds.createDimension("time", 10)
my_var = ds.createVariable("my_var", my_enum, "time") 
# --> no fill_value defined
my_var[:].data 
# -- > is full of 255
# which is the default fillvalue for unsigne byte (u1),
# but it's not a valid enum value

From my understanding of the thread above, I should not be able to create a variable typed by an enum that has a no fill value or a fill value not part of the enum possible values. Is my understanding correct ?

edit 11/10/2023: This makes ncdump example.nc crashes with a NetCDF: Invalid argument error.

DennisHeimbigner commented 1 year ago

This issue came up a couple of years ago. I pointed out to the HDF5 group that this was possible. They decided that it not of concern to them. So there is only a limited mitigation available to us. I can perhaps solve the problem for our Zarr implementation, but probably not for netcdf-4/HDF5.

bzah commented 1 year ago

Ok, it would still be nice to have a warning about this behavior in the documentation. A warning like : "An enum value matching fill_value is recommended" or "fill_value should be a valid enum value".

Also, I think enums are not yet part of zarr specification.

DennisHeimbigner commented 1 year ago

I can certainly add a warning. As for Zarr, I was working on an enum extension for NCZarr.