nc_def_var_extra() error checking not appropriate for how it is called...

edhartnett commented 6 years ago

nc_def_var_extra() is called by nc_def_var_deflate(), nc_def_var_chunking(), etc.

It checks errors incorrectly. It assumes that all parameters are present, when only a subset ever are. So the checks don't work.

For example, this code should not succeed, but it does:

         /* Create a netcdf-4 file. */
         if (nc_create(FILE_NAME, mode, &ncid)) ERR;
         if (nc_def_dim(ncid, DIM8_NAME, TEST_VAL_42, &dimids[0])) ERR;
         if (nc_def_var(ncid, VAR_NAME8, NC_INT, NDIMS6, dimids, &varid)) ERR;

         /* Set the var to contiguous. */
         if (nc_def_var_chunking(ncid, varid, NC_CONTIGUOUS, NULL)) ERR;

         /* Now defalte can't be set. */
         if (nc_def_var_deflate(ncid, varid, 0, 1, 4)) ERR;

         if (nc_close(ncid)) ERR;

nc_def_var_extra() checks for this condition like this:

   /* Can't turn on contiguous and deflate/fletcher32/szip. */
   if (contiguous)
      if ((*contiguous != NC_CHUNKED && deflate) ||
          (*contiguous != NC_CHUNKED && fletcher32))
         return NC_EINVAL;

This check does not work because first contiguous is set, then deflate in a separate call. So in the call from nc_def_var_deflate(), the contiguous pointer is NULL so the check is not run.

(What ends up happening with the code above is that def_var_deflate() silently changes the contiguous setting to chunked when it turns on deflation. Instead it should return an NC_EINVAL error.)

I will clean this up.

DennisHeimbigner commented 6 years ago

It seems to me that the code you reference: if (contiguous) if ((contiguous != NC_CHUNKED && deflate) || (contiguous != NC_CHUNKED && fletcher32)) return NC_EINVAL; must check also that chunking was set by looking at var->chunking and/or var->contiguous.

edhartnett commented 6 years ago

Yes it must check the var settings not the parameters. There is no way to pass the contiguous and fletcher32 parameters together for example, one is in nc_def_var_chunking() and the other is in nc_def_var_fletcher().

edhartnett commented 6 years ago

OK, now I see that it's not possible for nc_def_var_extra() to check this unless we are will to impose an order on the calling of def_var_chunking and def_var_deflate, which we are not.

Consider this (working) test code from ncdump tst_special_atts.c:

   /* Specify chunking for var2, note only uses CHUNK1 and CHUNK2.
      Also, specify using a checksum for this variable. */
   if (nc_def_var_chunking(ncid, var2id, NC_CHUNKED, chunksizes)) ERR;
   if (nc_def_var_fletcher32(ncid, var2id, NC_FLETCHER32)) ERR;
   if (nc_def_var_endian(ncid, var2id, NC_ENDIAN_BIG)) ERR;

   /* Use compression but no shuffle for var3, also set to big endian */
   if (nc_def_var_deflate(ncid, var3id, NC_NOSHUFFLE, COMPRESS, DEFLATE_LEVEL)) ERR;
   if (nc_def_var_chunking(ncid, var3id, NC_CHUNKED, chunksizes)) ERR;
   if (nc_def_var_endian(ncid, var3id, NC_ENDIAN_LITTLE)) ERR;

In the first case, chunking is turned on before fletcher32.

In the second case, chunking is set after deflate is set. (This causes deflate to turn on chunking silently.)

If I now check that chunking is set before deflate, the second set of code does not work. It demands that I call nc_def_var_chunking() before calling nc_def_var_deflate() or nc_def_var_fletcher32(). The reason is that vars are contiguous by default.

OK, so this can't be checked. I will add some notes to the documentation and we will call it a feature.

edhartnett commented 6 years ago

As I have documented and tested this, I see that it was my original intention for this code, that NC_CHUNKED would be automatically turned on in certain cases.

But as I look at the code, I don't know of proper default chunksizes are computed if fletcher32 or the shuffle filter are turned on. (They are computed properly when deflate is turned on.) I will add some more tests for this...

DennisHeimbigner commented 6 years ago

I think we have to undo one of your previous assumptions. Since filter setting and such require chunking, I think that they must fail if, when e.g. nc_def_var_deflate is called, chunking has not been set. This means that we do indeed force an order on nc_def_var_XXX functions. But not sure that is any worse than requiring the dimensions be set for a variable before chunking is set.

edhartnett commented 6 years ago

If I understand you correctly, you are proposing that we return an error (NC_EINVAL) if I set deflate with nc_def_var_deflate(), without first setting chunking by calling nc_def_var_chunking(). Is that correct? And this would also apply to nc_def_var_fletcher32() and nc_def_var_filter(). (But it does not apply to nc_def_var_endian().)

I am certainly willing, and this would be very easy to implement in code.

But is it wise? I get the attraction - it makes logical sense. But for the last 10 years, this has not been necessary, and if we make it necessary now, then we may break some user code. And really, it is also not a big problem to leave it the way it is, document it, and just move on with life. ;-) It does not cause a lot of contortions in the code to just assume that if you want deflate, you also want chunked data.

IMO chunked vs. contiguous is not really a user feature like compression is. The user cares about compression. because he or she can see that. But chunked vs. contiguous seems like more of an internal HDF5 implementation detail. If I want compression, I want compression, and do whatever is necessary to give it to me. Similarly for checksums. This is the reasoning for the current functioning.

(In some ways it would be easier to switch the default to chunked and not contiguous, but we really can't change that now.)

I suggest we accept the current behavior, document it, and make sure it works reasonably well.

However, if you and @WardF feel differently, I certainly don't feel very strongly about it. So think it over, discuss, and let me know. I will make it do whatever the group consensus is.

DennisHeimbigner commented 6 years ago

All good points. But remember that chunking also has significant performance consequences and the "correct" chunk spec can only be determined by the dataset creator. The net effect of leaving things like they are (or were) is that the required checks must occur at nc_enddef. So the error message is delayed.

WardF commented 6 years ago

After our experience with changing expected behaviors with 4.5.0 (and subsequent reversions) I’m loathe to make changes that might impact people’s workflows. It would be great I’d deflate took, as an argument, a token or some such generated when chunking was specified. That’s just a means to an end I suppose. Since they would be getting errors without setting chunking already I am torn on this. But I guess that’s not the case, chunking with default values is turned on.

For now I think we could leave this issue open to revisit in the future and come back to it after 4.6? In the meantime perhaps we can continue to use the extant workflow that works for people. That way we won’t have to rush 4.7 for the sake of fixing workflows.

edhartnett commented 6 years ago

OK, for now I will thoroughly document the existing behavior, and make sure it is working correctly.

After 4.6.0 is released, if someone has an idea on how to move this forward usefully, I will let them suggest it. I believe we are stuck with the current behavior, but am willing to do anything that you two guys can agree on.

edhartnett commented 6 years ago

BTW @DennisHeimbigner not sure what you mean by:

the required checks must occur at nc_enddef

When deflate is turned on, chunking is turned on, and default chunk sizes are determined. This all happens in nc_def_var_extra(), not in nc_enddef().

Once concern I have is that while fletcher32 and shuffle will also change the data to chunked, I don't see, through code inspection, where they call the function that sets the default chunk sizes. So what the heck chunksizes do they use? I will look into this and see...

edhartnett commented 6 years ago

OK, what happens is that nc4_find_default_chunksizes2() is called in NC4_def_var(), even though contiguous is the default storage. As a result, when nc_def_var_fletcher32() or nc_def_var_deflate() with shuffle is called, and the storage is switched to chunked, then there are already sensible chunksizes available.

I will add some test code to confirm this, then (finally!) close this ticket.

Unidata / netcdf-c

nc_def_var_extra() error checking not appropriate for how it is called... #710