Closed cponder closed 2 years ago
This brings in the questions, could you please (a) use an output-field wide enough to show the full values (exponential notation would be ok), and (b) print the actual differing values in the default case, so I can get a sense of the scale of the difference?
I'm using PNetCDF 1.12.2.
Please make the two files available for me to test. Could you also try "ncmpidiff" to see if the same happens?
Running this form
mpirun -np 2 ncmpidiff -t 0,1 2022-02-12.*/Tests/*/history.2010-10-23_00.06.00.nc
gives the short list
DIFF (tolerance): variable "qc" of type "NC_FLOAT" max at element [0, 31595, 11] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qc" of type "NC_FLOAT" max at element [0, 125999, 18] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 2551, 15] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 84133, 15] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qs" of type "NC_FLOAT" max at element [0, 250, 14] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qs" of type "NC_FLOAT" max at element [0, 83234, 15] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qg" of type "NC_FLOAT" max at element [0, 250, 14] of value 0.000000 vs 0.000000
DIFF (tolerance): variable "qg" of type "NC_FLOAT" max at element [0, 83234, 15] of value 0.000000 vs 0.000000
Number of differences in variables 8
whereas the default form
mpirun -np 2 ncmpidiff 2022-02-12.*/Tests/*/history.2010-10-23_00.06.00.nc
gives this longer list
DIFF: global attribute "file_id" of type "NC_CHAR" differs first at element 0
DIFF: variable "qv" of type "NC_FLOAT" first at element [0, 82389, 12]
DIFF: variable "qv" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "qc" of type "NC_FLOAT" first at element [0, 93316, 16]
DIFF: variable "qc" of type "NC_FLOAT" first at element [0, 31595, 11]
DIFF: variable "qi" of type "NC_FLOAT" first at element [0, 82389, 11]
DIFF: variable "qi" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "qs" of type "NC_FLOAT" first at element [0, 82389, 12]
DIFF: variable "qs" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "qg" of type "NC_FLOAT" first at element [0, 83234, 15]
DIFF: variable "qg" of type "NC_FLOAT" first at element [0, 250, 14]
DIFF: variable "pressure" of type "NC_FLOAT" first at element [0, 82395, 16]
DIFF: variable "pressure" of type "NC_FLOAT" first at element [0, 250, 16]
DIFF: variable "theta" of type "NC_FLOAT" first at element [0, 82395, 16]
DIFF: variable "theta" of type "NC_FLOAT" first at element [0, 250, 16]
DIFF: variable "relhum" of type "NC_FLOAT" first at element [0, 82389, 12]
DIFF: variable "relhum" of type "NC_FLOAT" first at element [0, 54, 14]
DIFF: variable "ertel_pv" of type "NC_FLOAT" first at element [0, 82394, 15]
DIFF: variable "ertel_pv" of type "NC_FLOAT" first at element [0, 46, 18]
DIFF: variable "u_pv" of type "NC_FLOAT" first at element [0, 100393]
DIFF: variable "u_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "v_pv" of type "NC_FLOAT" first at element [0, 100393]
DIFF: variable "v_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "theta_pv" of type "NC_FLOAT" first at element [0, 100392]
DIFF: variable "theta_pv" of type "NC_FLOAT" first at element [0, 10477]
DIFF: variable "vort_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "vort_pv" of type "NC_FLOAT" first at element [0, 103108]
DIFF: variable "depv_dt_mix" of type "NC_FLOAT" first at element [0, 83234, 16]
DIFF: variable "depv_dt_mix" of type "NC_FLOAT" first at element [0, 949, 15]
DIFF: variable "depv_dt_diab" of type "NC_FLOAT" first at element [0, 23323, 15]
DIFF: variable "depv_dt_diab" of type "NC_FLOAT" first at element [0, 110333, 13]
DIFF: variable "depv_dt_fric" of type "NC_FLOAT" first at element [0, 82394, 15]
DIFF: variable "depv_dt_fric" of type "NC_FLOAT" first at element [0, 46, 16]
DIFF: variable "depv_dt_diab_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "depv_dt_diab_pv" of type "NC_FLOAT" first at element [0, 100393]
DIFF: variable "depv_dt_fric_pv" of type "NC_FLOAT" first at element [0, 9907]
DIFF: variable "depv_dt_fric_pv" of type "NC_FLOAT" first at element [0, 100392]
DIFF: variable "precipw" of type "NC_FLOAT" first at element [0, 442]
DIFF: variable "precipw" of type "NC_FLOAT" first at element [0, 82389]
Number of differences in variables 38
So far this web-page won't let me upload the (zipped) input files, it says
is not included in the list
This happens under both Firefox & Chrome. Do you have some other upload-server that I can use?
Looking at this thread here https://github.com/github/hub/issues/1479 it looks like the problem is that my files are too large. I don't have smaller cases from the kinds of simulations I'm running.
Please let me know if the following patch works for you.
diff --git a/src/utils/ncmpidiff/cdfdiff.c b/src/utils/ncmpidiff/cdfdiff.c
index f8273071..65b994ea 100644
--- a/src/utils/ncmpidiff/cdfdiff.c
+++ b/src/utils/ncmpidiff/cdfdiff.c
@@ -65,6 +65,17 @@
chunk_off += nelems; \
}
+#define PRINTF_VAR_DIFF(itype) { \
+ int esize = sizeof(itype); \
+ itype *b1 = (itype*)buf[0]; \
+ itype *b2 = (itype*)buf[1]; \
+ if (esize > 1 && is_little_endian) { \
+ swapn(b1, 1, esize); \
+ swapn(b2, 1, esize); \
+ } \
+ printf("] of value %g vs %g\n", (double)b1[0], (double)b2[0]); \
+}
+
#define SWAP2B(a) ( (((a) & 0xff) << 8) | \
(((a) >> 8) & 0xff) )
@@ -977,9 +988,19 @@ cmp_vars:
var_name, get_type(xtype[0]), diffStart[0]);
for (_i=1; _i<var[0]->ndims; _i++)
printf(", %lld", diffStart[_i]);
- printf("]\n");
free(diffStart);
}
+ if (xtype[0] == NC_CHAR) PRINTF_VAR_DIFF(char)
+ else if (xtype[0] == NC_BYTE) PRINTF_VAR_DIFF(signed char)
+ else if (xtype[0] == NC_UBYTE) PRINTF_VAR_DIFF(unsigned char)
+ else if (xtype[0] == NC_SHORT) PRINTF_VAR_DIFF(short)
+ else if (xtype[0] == NC_USHORT) PRINTF_VAR_DIFF(unsigned short)
+ else if (xtype[0] == NC_INT) PRINTF_VAR_DIFF(int)
+ else if (xtype[0] == NC_UINT) PRINTF_VAR_DIFF(unsigned int)
+ else if (xtype[0] == NC_FLOAT) PRINTF_VAR_DIFF(float)
+ else if (xtype[0] == NC_DOUBLE) PRINTF_VAR_DIFF(double)
+ else if (xtype[0] == NC_INT64) PRINTF_VAR_DIFF(long long)
+ else if (xtype[0] == NC_UINT64) PRINTF_VAR_DIFF(unsigned long long)
}
worst = 0;
numVarDIFF++;
@@ -1000,7 +1021,7 @@ cmp_vars:
/* diff is found when check_tolerance = 1 */
if (ndims[0] == 0) { /* scalar variable */
- printf("DIFF (tolerance): scalar variable \"%s\" of type \"%s\" of value %f vs %f\n",
+ printf("DIFF (tolerance): scalar variable \"%s\" of type \"%s\" of value %g vs %g\n",
var_name, get_type(xtype[0]), worst1, worst2);
} else {
int _i;
@@ -1024,7 +1045,7 @@ cmp_vars:
if (worst == -1)
printf("]\n");
else
- printf("] of value %f vs %f\n", worst1, worst2);
+ printf("] of value %g vs %g\n", worst1, worst2);
free(diffStart);
}
numVarDIFF++;
With the modifications, I still see a longer list from the default than I do from -t 0,1
.
Here's a difference in the output
< DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 0, 33] of value 2.66454e-15 vs 0
> DIFF: variable "qi" of type "NC_FLOAT" at element [0, 0, 33] of value 0 vs 0
with the parameterized case first and the default case second.
I believe I applied the patch correctly, since (a) the patch
command didn't give any error-messages when I ran it, and (b) the default case actually prints the compared values, which it didn't before.
The "relative tolerance ratio" is defined in your PR #53
error_ratio = -1.0 + b1[pos] / b2[pos];
So, in the case of b1[pos] is 2.66454e-15 and b2[pos] is a very small number that fprintf format "%g" prints it into 0, the error_ratio can actually be larger than 1.
Ok -- sorry -- I hadn't looked at this in a long time. So likely the lists were matching the whole time. What it was printing out before, though
< DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 0, 33] of value 2.66454e-15 vs 0
> DIFF: variable "qi" of type "NC_FLOAT" at element [0, 0, 33] of value 0 vs 0
it looks like the second line was rounding the output to 0 instead of using exponential notation, right?
That is strange. The second line of outputs should show 2.66454e-15 vs 0
.
Can you try these two files?
With these two files, I see the output line up:
--> cdfdiff file1.nc file2.nc
DIFF: variable "var" of type "NC_FLOAT" at element [0] of value 2.66454e-15 vs 0
--> cdfdiff -t 0,0 file1.nc file2.nc
DIFF (tolerance): variable "var" of type "NC_FLOAT" max at element [0] of value 2.66454e-15 vs 0
My files still show some differences:
--> cdfdiff -t 0,0 2022-02-12.*/Tests/*.60km.CPU.04procs.15min_rad/history.2010-10-23_00.30.00.nc | fgrep qi
DIFF (tolerance): variable "qi" of type "NC_FLOAT" max at element [0, 0, 33] of value 2.66454e-15 vs 0
--> cdfdiff 2022-02-12.*/Tests/*.60km.CPU.04procs.15min_rad/history.2010-10-23_00.30.00.nc | fgrep qi
DIFF: variable "qi" of type "NC_FLOAT" at element [0, 0, 33] of value 0 vs 0
One question -- I'm not sure if these had been written using PNetCDF or plain NetCDF, Is the tool supposed to work in both cases?
Could you please make your two files somewhere available for me to download, so I can use them to debug?
cdfdiff should work for all NetCDF files (CDF-1, 2,5) generated by any software.
See if you can access this:
sftp carl-ftp@corpftp.nvidia.com
RvdG3Wbx
get *.nc
and tell me when you've finished, ok?
Thanks. I have copied the files. You can safely remove them.
One question. In your PR #53, I am seeing the difference tolerance ratio, 'error_ratio' is initialized to the command-line option '-t', but is updated to a new value whenever a difference fails the tolerance. See the line below.
129 error_ratio = -1.0 + b1[pos] / b2[pos];
Similarly for the absolute difference, 'error_difference'.
133 error_difference = b1[pos] - b2[pos];
My question is shouldn't the two be kept unchanged?
What I'm doing here is, when I find an entry that is flagged by the tolerance, I raise the tolerance requirement to it won't flag any entry "lower" than the one that was just flagged. This doesn't look like a good idea now, because (a) you only list the first failing element in each array, as opposed to the maximally different element in the array, right? And (b) when I adjust the tolerance like this, then it will affect whether other arrays get flagged or not, right?
Also, would you rather that the "ratio" tolerance work differently? I've been thinking about other ways to express it. This form
min(a,b)/max(a,b)
would always be a number between [0,1], and the case where a & b are both 0 we would assign to be 1. And we would set 1 as the default ratio, so a ratio <1 indicates that the values are unequal.
The other corner-case is where the signs don't match, I suppose the relative tolerance should always fail in this case, and you'd use the absolute tolerance to decide whether the difference is enough to care about.
What I'm doing here is, when I find an entry that is flagged by the tolerance, I raise the tolerance requirement to it won't flag any entry "lower" than the one that was just flagged. This doesn't look like a good idea now, because (a) you only list the first failing element in each array, as opposed to the maximally different element in the array, right? And (b) when I adjust the tolerance like this, then it will affect whether other arrays get flagged or not, right?
Both are correct.
Also, would you rather that the "ratio" tolerance work differently? I've been thinking about other ways to express it. This form
min(a,b)/max(a,b)
would always be a number between [0,1], and the case where a & b are both 0 we would assign to be 1. And we would set 1 as the default ratio, so a ratio <1 indicates that the values are unequal.
The other corner-case is where the signs don't match, I suppose the relative tolerance should always fail in this case, and you'd use the absolute tolerance to decide whether the difference is enough to care about.
I suggest to use the absolute values to calculate ratio, i.e.
min(|a|,|b|)/max(|a|,|b|)
This would give a false match of (1,-1). If we're going to use a raw ratio like this, I'd prefer that, if the signs don't match, then give it a ratio of 0. But, yeah, in the case where both are negative, we'd want the absolute values there.
I think the name "ratio" should abide by its common known definition, i.e.
a/b - 1
which means change ratio of a to b is different from b to a. I wish there is a way to allow users to pass in a customized function for calculating ratio.
Wikipedia page mentions this one below for relative difference
. I like it better.
|x - y|/max(|x|, |y|)
That looks like a good formula. It deals with the case where the signs are different, since we just re-scale the distance between the two. The only corner-case is (0,0) that I think we ought to map to 0 since the distance is 0 before re-scaling.
OK. I pushed the changes using this formula to the master branch. Please give it a try and let me know if it works with you. (You can also wget the file https://github.com/Parallel-NetCDF/PnetCDF/blob/master/src/utils/ncmpidiff/cdfdiff.c)
I see an exact match between the output with & without the -t 0,0
, which is what we want.
I'm still getting used to the relative-difference metric, but it all looks great!
Can you please spin a new release soon?
I considered this issued resolved and will make a new release. Thanks for a good discussion.
I see you tagged the 1.12.3 here yesterday, But the website https://parallel-netcdf.github.io/wiki/Download.html says that 1.12.3 was posted in January. This is the one with the fix, right?
I have fixed the date on the website. It should be February 21, 2022. Thanks.
I expect this command to show any differences that exist between the two files, since the absolute tolerance is 0 (for no difference) and the ratio tolerance is 1 (for no difference):
It gives this output
I assume this means that the values are different but you are printing too narrow an output field for me to see this, right? Now when I use this form instead, I'm not assigning any explicit tolerances
and I get a different list:
Since it's not showing me the values, I can't tell what the difference is and of what scale, I would expect the default tolerance to be the same as
-t 0,1
but evidently it's not.