Parallel-NetCDF / PnetCDF

Source code repository of PnetCDF library and utilities
https://parallel-netcdf.github.io
Other
83 stars 23 forks source link

Difference between explicit vs.default tolerance with cdfdiff #78

Closed cponder closed 2 years ago

cponder commented 2 years ago

I expect this command to show any differences that exist between the two files, since the absolute tolerance is 0 (for no difference) and the ratio tolerance is 1 (for no difference):

cdfdiff -t 0,1 2022-02-12.*/Tests/*/history.2010-10-23_00.06.00.nc

It gives this output

DIFF (tolerance): variable "qc" of type "NC_FLOAT" max at element [0, 31595, 11] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 2551, 15] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qs" of type "NC_FLOAT" max at element [0, 250, 14] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qg" of type "NC_FLOAT" max at element [0, 250, 14] of value 0.000000 vs 0.000000
Number of differences in variables: 4

I assume this means that the values are different but you are printing too narrow an output field for me to see this, right? Now when I use this form instead, I'm not assigning any explicit tolerances

cdfdiff  2022-02-12.*/Tests/*/history.2010-10-23_00.06.00.nc

and I get a different list:

DIFF: variable "qc" of type "NC_FLOAT" at element [0, 12870, 37]
DIFF: variable "qi" of type "NC_FLOAT" at element [0, 54, 14]
DIFF: variable "qs" of type "NC_FLOAT" at element [0, 54, 14]
DIFF: variable "qg" of type "NC_FLOAT" at element [0, 250, 14]
DIFF: variable "pressure" of type "NC_FLOAT" at element [0, 250, 16]
DIFF: variable "theta" of type "NC_FLOAT" at element [0, 250, 16]
DIFF: variable "relhum" of type "NC_FLOAT" at element [0, 54, 14]
DIFF: variable "ertel_pv" of type "NC_FLOAT" at element [0, 46, 18]
DIFF: variable "u_pv" of type "NC_FLOAT" at element [0, 9907]
DIFF: variable "v_pv" of type "NC_FLOAT" at element [0, 9907]
DIFF: variable "theta_pv" of type "NC_FLOAT" at element [0, 10477]
DIFF: variable "vort_pv" of type "NC_FLOAT" at element [0, 9907]
DIFF: variable "depv_dt_mix" of type "NC_FLOAT" at element [0, 949, 15]
DIFF: variable "depv_dt_diab" of type "NC_FLOAT" at element [0, 4598, 42]
DIFF: variable "depv_dt_fric" of type "NC_FLOAT" at element [0, 46, 16]
DIFF: variable "depv_dt_diab_pv" of type "NC_FLOAT" at element [0, 9907]
DIFF: variable "depv_dt_fric_pv" of type "NC_FLOAT" at element [0, 9907]
DIFF: variable "precipw" of type "NC_FLOAT" at element [0, 442]
Number of differences in variables: 19

Since it's not showing me the values, I can't tell what the difference is and of what scale, I would expect the default tolerance to be the same as -t 0,1 but evidently it's not.

cponder commented 2 years ago

This brings in the questions, could you please (a) use an output-field wide enough to show the full values (exponential notation would be ok), and (b) print the actual differing values in the default case, so I can get a sense of the scale of the difference?

cponder commented 2 years ago

I'm using PNetCDF 1.12.2.

wkliao commented 2 years ago

Please make the two files available for me to test. Could you also try "ncmpidiff" to see if the same happens?

cponder commented 2 years ago

Running this form

mpirun -np 2 ncmpidiff -t 0,1 2022-02-12.*/Tests/*/history.2010-10-23_00.06.00.nc

gives the short list

DIFF (tolerance): variable "qc" of type "NC_FLOAT" max at element [0, 31595, 11] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qc" of type "NC_FLOAT" max at element [0, 125999, 18] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 2551, 15] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 84133, 15] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qs" of type "NC_FLOAT" max at element [0, 250, 14] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qs" of type "NC_FLOAT" max at element [0, 83234, 15] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qg" of type "NC_FLOAT" max at element [0, 250, 14] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qg" of type "NC_FLOAT" max at element [0, 83234, 15] of value 0.000000 vs 0.000000
Number of differences in variables 8

whereas the default form

mpirun -np 2 ncmpidiff 2022-02-12.*/Tests/*/history.2010-10-23_00.06.00.nc

gives this longer list

DIFF: global attribute "file_id" of type "NC_CHAR" differs first at element 0
DIFF: variable "qv" of type "NC_FLOAT" first at element [0, 82389, 12]
DIFF: variable "qv" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "qc" of type "NC_FLOAT" first at element [0, 93316, 16]
DIFF: variable "qc" of type "NC_FLOAT" first at element [0, 31595, 11]
DIFF: variable "qi" of type "NC_FLOAT" first at element [0, 82389, 11]
DIFF: variable "qi" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "qs" of type "NC_FLOAT" first at element [0, 82389, 12]
DIFF: variable "qs" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "qg" of type "NC_FLOAT" first at element [0, 83234, 15]
DIFF: variable "qg" of type "NC_FLOAT" first at element [0, 250, 14]
DIFF: variable "pressure" of type "NC_FLOAT" first at element [0, 82395, 16]
DIFF: variable "pressure" of type "NC_FLOAT" first at element [0, 250, 16]
DIFF: variable "theta" of type "NC_FLOAT" first at element [0, 82395, 16]
DIFF: variable "theta" of type "NC_FLOAT" first at element [0, 250, 16]
DIFF: variable "relhum" of type "NC_FLOAT" first at element [0, 82389, 12]
DIFF: variable "relhum" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "ertel_pv" of type "NC_FLOAT" first at element [0, 82394, 15]
DIFF: variable "ertel_pv" of type "NC_FLOAT" first at element [0, 46, 18]
DIFF: variable "u_pv" of type "NC_FLOAT" first at element [0, 100393]
DIFF: variable "u_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "v_pv" of type "NC_FLOAT" first at element [0, 100393]
DIFF: variable "v_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "theta_pv" of type "NC_FLOAT" first at element [0, 100392]
DIFF: variable "theta_pv" of type "NC_FLOAT" first at element [0, 10477]
DIFF: variable "vort_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "vort_pv" of type "NC_FLOAT" first at element [0, 103108]
DIFF: variable "depv_dt_mix" of type "NC_FLOAT" first at element [0, 83234, 16]
DIFF: variable "depv_dt_mix" of type "NC_FLOAT" first at element [0, 949, 15]
DIFF: variable "depv_dt_diab" of type "NC_FLOAT" first at element [0, 23323, 15]
DIFF: variable "depv_dt_diab" of type "NC_FLOAT" first at element [0, 110333, 13]
DIFF: variable "depv_dt_fric" of type "NC_FLOAT" first at element [0, 82394, 15]
DIFF: variable "depv_dt_fric" of type "NC_FLOAT" first at element [0, 46, 16]
DIFF: variable "depv_dt_diab_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "depv_dt_diab_pv" of type "NC_FLOAT" first at element [0, 100393]
DIFF: variable "depv_dt_fric_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "depv_dt_fric_pv" of type "NC_FLOAT" first at element [0, 100392]
DIFF: variable "precipw" of type "NC_FLOAT" first at element [0, 442]
DIFF: variable "precipw" of type "NC_FLOAT" first at element [0, 82389]
Number of differences in variables 38
cponder commented 2 years ago

So far this web-page won't let me upload the (zipped) input files, it says

is not included in the list

This happens under both Firefox & Chrome. Do you have some other upload-server that I can use?

cponder commented 2 years ago

Looking at this thread here https://github.com/github/hub/issues/1479 it looks like the problem is that my files are too large. I don't have smaller cases from the kinds of simulations I'm running.

wkliao commented 2 years ago

Please let me know if the following patch works for you.

  diff --git a/src/utils/ncmpidiff/cdfdiff.c b/src/utils/ncmpidiff/cdfdiff.c
  index f8273071..65b994ea 100644
  --- a/src/utils/ncmpidiff/cdfdiff.c
  +++ b/src/utils/ncmpidiff/cdfdiff.c
  @@ -65,6 +65,17 @@
       chunk_off += nelems;                                                 \
   }

  +#define PRINTF_VAR_DIFF(itype) {                                         \
  +    int esize = sizeof(itype);                                           \
  +    itype *b1 = (itype*)buf[0];                                          \
  +    itype *b2 = (itype*)buf[1];                                          \
  +    if (esize > 1 && is_little_endian) {                                 \
  +        swapn(b1, 1, esize);                                             \
  +        swapn(b2, 1, esize);                                             \
  +    }                                                                    \
  +    printf("] of value %g vs %g\n", (double)b1[0], (double)b2[0]);       \
  +}
  +
   #define SWAP2B(a) ( (((a) & 0xff) << 8) | \
                       (((a) >> 8) & 0xff) )

  @@ -977,9 +988,19 @@ cmp_vars:
                                      var_name, get_type(xtype[0]), diffStart[0]);
                               for (_i=1; _i<var[0]->ndims; _i++)
                                   printf(", %lld", diffStart[_i]);
  -                            printf("]\n");
                               free(diffStart);
                           }
  +                             if (xtype[0] == NC_CHAR)   PRINTF_VAR_DIFF(char)
  +                        else if (xtype[0] == NC_BYTE)   PRINTF_VAR_DIFF(signed char)
  +                        else if (xtype[0] == NC_UBYTE)  PRINTF_VAR_DIFF(unsigned char)
  +                        else if (xtype[0] == NC_SHORT)  PRINTF_VAR_DIFF(short)
  +                        else if (xtype[0] == NC_USHORT) PRINTF_VAR_DIFF(unsigned short)
  +                        else if (xtype[0] == NC_INT)    PRINTF_VAR_DIFF(int)
  +                        else if (xtype[0] == NC_UINT)   PRINTF_VAR_DIFF(unsigned int)
  +                        else if (xtype[0] == NC_FLOAT)  PRINTF_VAR_DIFF(float)
  +                        else if (xtype[0] == NC_DOUBLE) PRINTF_VAR_DIFF(double)
  +                        else if (xtype[0] == NC_INT64)  PRINTF_VAR_DIFF(long long)
  +                        else if (xtype[0] == NC_UINT64) PRINTF_VAR_DIFF(unsigned long long)
                       }
                       worst = 0;
                       numVarDIFF++;
  @@ -1000,7 +1021,7 @@ cmp_vars:

           /* diff is found when check_tolerance = 1 */
           if (ndims[0] == 0) { /* scalar variable */
  -            printf("DIFF (tolerance): scalar variable \"%s\" of type \"%s\" of value %f vs %f\n",
  +            printf("DIFF (tolerance): scalar variable \"%s\" of type \"%s\" of value %g vs %g\n",
                      var_name, get_type(xtype[0]), worst1, worst2);
           } else {
               int _i;
  @@ -1024,7 +1045,7 @@ cmp_vars:
               if (worst == -1)
                   printf("]\n");
               else
  -                printf("] of value %f vs %f\n", worst1, worst2);
  +                printf("] of value %g vs %g\n", worst1, worst2);
               free(diffStart);
           }
           numVarDIFF++;
cponder commented 2 years ago

With the modifications, I still see a longer list from the default than I do from -t 0,1. Here's a difference in the output

< DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 0, 33] of value 2.66454e-15 vs 0
> DIFF: variable "qi" of type "NC_FLOAT" at element [0, 0, 33] of value 0 vs 0

with the parameterized case first and the default case second. I believe I applied the patch correctly, since (a) the patch command didn't give any error-messages when I ran it, and (b) the default case actually prints the compared values, which it didn't before.

wkliao commented 2 years ago

The "relative tolerance ratio" is defined in your PR #53

    error_ratio = -1.0 +  b1[pos] / b2[pos];

So, in the case of b1[pos] is 2.66454e-15 and b2[pos] is a very small number that fprintf format "%g" prints it into 0, the error_ratio can actually be larger than 1.

cponder commented 2 years ago

Ok -- sorry -- I hadn't looked at this in a long time. So likely the lists were matching the whole time. What it was printing out before, though

< DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 0, 33] of value 2.66454e-15 vs 0
> DIFF: variable "qi" of type "NC_FLOAT" at element [0, 0, 33] of value 0 vs 0

it looks like the second line was rounding the output to 0 instead of using exponential notation, right?

wkliao commented 2 years ago

That is strange. The second line of outputs should show 2.66454e-15 vs 0. Can you try these two files?

cponder commented 2 years ago

With these two files, I see the output line up:

--> cdfdiff         file1.nc file2.nc
DIFF: variable "var" of type "NC_FLOAT" at element [0] of value 2.66454e-15 vs 0

-->  cdfdiff -t 0,0 file1.nc file2.nc
DIFF (tolerance): variable "var" of type "NC_FLOAT" max at element [0] of value 2.66454e-15 vs 0

My files still show some differences:

--> cdfdiff -t 0,0 2022-02-12.*/Tests/*.60km.CPU.04procs.15min_rad/history.2010-10-23_00.30.00.nc  | fgrep qi
DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 0, 33] of value 2.66454e-15 vs 0

--> cdfdiff        2022-02-12.*/Tests/*.60km.CPU.04procs.15min_rad/history.2010-10-23_00.30.00.nc  | fgrep qi
DIFF: variable "qi" of type "NC_FLOAT" at element [0, 0, 33] of value 0 vs 0

One question -- I'm not sure if these had been written using PNetCDF or plain NetCDF, Is the tool supposed to work in both cases?

wkliao commented 2 years ago

Could you please make your two files somewhere available for me to download, so I can use them to debug?

cdfdiff should work for all NetCDF files (CDF-1, 2,5) generated by any software.

cponder commented 2 years ago

See if you can access this:

sftp carl-ftp@corpftp.nvidia.com
RvdG3Wbx
get *.nc

and tell me when you've finished, ok?

wkliao commented 2 years ago

Thanks. I have copied the files. You can safely remove them.

wkliao commented 2 years ago

One question. In your PR #53, I am seeing the difference tolerance ratio, 'error_ratio' is initialized to the command-line option '-t', but is updated to a new value whenever a difference fails the tolerance. See the line below.

129            error_ratio = -1.0 +  b1[pos] / b2[pos];

Similarly for the absolute difference, 'error_difference'.

133            error_difference =  b1[pos] - b2[pos];

My question is shouldn't the two be kept unchanged?

cponder commented 2 years ago

What I'm doing here is, when I find an entry that is flagged by the tolerance, I raise the tolerance requirement to it won't flag any entry "lower" than the one that was just flagged. This doesn't look like a good idea now, because (a) you only list the first failing element in each array, as opposed to the maximally different element in the array, right? And (b) when I adjust the tolerance like this, then it will affect whether other arrays get flagged or not, right?

cponder commented 2 years ago

Also, would you rather that the "ratio" tolerance work differently? I've been thinking about other ways to express it. This form

min(a,b)/max(a,b)

would always be a number between [0,1], and the case where a & b are both 0 we would assign to be 1. And we would set 1 as the default ratio, so a ratio <1 indicates that the values are unequal.

The other corner-case is where the signs don't match, I suppose the relative tolerance should always fail in this case, and you'd use the absolute tolerance to decide whether the difference is enough to care about.

wkliao commented 2 years ago

What I'm doing here is, when I find an entry that is flagged by the tolerance, I raise the tolerance requirement to it won't flag any entry "lower" than the one that was just flagged. This doesn't look like a good idea now, because (a) you only list the first failing element in each array, as opposed to the maximally different element in the array, right? And (b) when I adjust the tolerance like this, then it will affect whether other arrays get flagged or not, right?

Both are correct.

wkliao commented 2 years ago

Also, would you rather that the "ratio" tolerance work differently? I've been thinking about other ways to express it. This form

min(a,b)/max(a,b)

would always be a number between [0,1], and the case where a & b are both 0 we would assign to be 1. And we would set 1 as the default ratio, so a ratio <1 indicates that the values are unequal.

The other corner-case is where the signs don't match, I suppose the relative tolerance should always fail in this case, and you'd use the absolute tolerance to decide whether the difference is enough to care about.

I suggest to use the absolute values to calculate ratio, i.e.

min(|a|,|b|)/max(|a|,|b|)
cponder commented 2 years ago

This would give a false match of (1,-1). If we're going to use a raw ratio like this, I'd prefer that, if the signs don't match, then give it a ratio of 0. But, yeah, in the case where both are negative, we'd want the absolute values there.

wkliao commented 2 years ago

I think the name "ratio" should abide by its common known definition, i.e.

a/b - 1

which means change ratio of a to b is different from b to a. I wish there is a way to allow users to pass in a customized function for calculating ratio.

wkliao commented 2 years ago

Wikipedia page mentions this one below for relative difference. I like it better.

|x - y|/max(|x|, |y|)
cponder commented 2 years ago

That looks like a good formula. It deals with the case where the signs are different, since we just re-scale the distance between the two. The only corner-case is (0,0) that I think we ought to map to 0 since the distance is 0 before re-scaling.

wkliao commented 2 years ago

OK. I pushed the changes using this formula to the master branch. Please give it a try and let me know if it works with you. (You can also wget the file https://github.com/Parallel-NetCDF/PnetCDF/blob/master/src/utils/ncmpidiff/cdfdiff.c)

cponder commented 2 years ago

I see an exact match between the output with & without the -t 0,0, which is what we want. I'm still getting used to the relative-difference metric, but it all looks great! Can you please spin a new release soon?

wkliao commented 2 years ago

I considered this issued resolved and will make a new release. Thanks for a good discussion.

cponder commented 2 years ago

I see you tagged the 1.12.3 here yesterday, But the website https://parallel-netcdf.github.io/wiki/Download.html says that 1.12.3 was posted in January. This is the one with the fix, right?

wkliao commented 2 years ago

I have fixed the date on the website. It should be February 21, 2022. Thanks.