LLNL / H5Z-ZFP

A registered ZFP compression plugin for HDF5
Other
50 stars 22 forks source link

Enable CMake Testing #116

Closed brtnfld closed 1 year ago

brtnfld commented 1 year ago

We now have most of the CMake testing issues resolved. However, one issue we are seeing is the tests

test-lib-rate-read-32 test-rate-read-32

fail due to the values exceeding the error tolerances, https://github.com/LLNL/H5Z-ZFP/actions/runs/4940539669/jobs/8832341432

Setting: max_reldiff=1e-07 set maximum relative diff Getting: 2: Relative Diffs: 1 values are different; actual-max-reldiff = 2.62382e-07

@lindstro, do we know if this is a known ZFP issue, or are we assuming something in our tests that may be different on Windows?

markcmiller86 commented 1 year ago

@lindstro, do we know if this is a known ZFP issue, or are we assuming something in our tests that may be different on Windows?

@brtnfld I wouldn't necessarily vouch too awful much for my original coding of tolerance values for those tests. My main objective was to simply assure ZFP compressed results were in the ballpark of what one would expect in terms of either size of resulting file or other heuristics. I think in this case, we have a read client which reads back into memory the ZFP compressed HDF5 data and compares that with original data using either absolute or relative differencing.

I think if we have to fudge the test thresholds up a bit, to get a pass, I would not be surprised. I simply examined what I was observing on my macOS system in the way of difference thresholds and then picked a threshold that was as small as possible but nonetheless accepted whatever I was seeing in my manual observations of diffs.

brtnfld commented 1 year ago

Sounds good. I will loosen the tolerances for the Windows tests.

lindstro commented 1 year ago

@lindstro, do we know if this is a known ZFP issue, or are we assuming something in our tests that may be different on Windows?

@brtnfld The log file is very long and I don't know what I should be looking for. What is max_reldiff, how does it relate to zfp's compression modes, and what are expected and observed behaviors?

brtnfld commented 1 year ago

I've adjusted the testing tolerance for Windows.

The only remaining issue is random segfaults on Windows for ctest. It happens randomly for any of the tests, not just one test randomly segfaulting. So, to get the GitHub actions to pass, I repeat the ctest until all the tests pass (max 5 times).

markcmiller86 commented 1 year ago

It happens randomly for any of the tests, not just one test randomly segfaulting. So, to get the GitHub actions to pass, I repeat the ctest until all the tests pass (max 5 times).

Hmmm...sounds bad. Is it seen only on Windows? Would a valgrind run on Linux shed any light?

brtnfld commented 1 year ago

@lindstro, do we know if this is a known ZFP issue, or are we assuming something in our tests that may be different on Windows?

@brtnfld The log file is very long and I don't know what I should be looking for. What is max_reldiff, how does it relate to zfp's compression modes, and what are expected and observed behaviors?

@lindstro, Mark mentioned that the tolerances were ball-park values, so I'm going to assume it is ok to loosen the tolerances on windows.

brtnfld commented 1 year ago

I've only encountered it on Windows.

lindstro commented 1 year ago

@lindstro, do we know if this is a known ZFP issue, or are we assuming something in our tests that may be different on Windows?

@brtnfld The log file is very long and I don't know what I should be looking for. What is max_reldiff, how does it relate to zfp's compression modes, and what are expected and observed behaviors?

@lindstro, Mark mentioned that the tolerances were ball-park values, so I'm going to assume it is ok to loosen the tolerances on windows.

@brtnfld I actually don't know what Mark's tests do, so it is difficult for me to advise. That said, if zfp is fed the exact same input data, then we would expect bit-for-bit identical outputs across platforms. One challenge is producing identical input data if it's generated in floating point, as rounding modes and compiler optimizations can introduce differences.

We've been doing some recent work on portable test data generation that will go into the zfp CLI. We might be able to piggyback on that, though I suspect this code won't be ready for another month or two.

markcmiller86 commented 1 year ago

We have a windows developer/expert I might be able to have take a look at it with the toolchain she has access to on Windows.

@biagas would you have any time/resources to help debug a problem we're having here on Windows? The problem is that we're seeing intermittent failures in ctest of this HDF5 compression filter but only on Windows. You can see how we build on Windows by looking at the CI logic, https://github.com/LLNL/H5Z-ZFP/blob/eb544004350b717fe7f27a0d9ad3e521ce71e308/.github/workflows/main.yml#L83-L121

If its encouraging in any way, this is relevant to Silo, VisIt and ECP Ascent project :wink:.

markcmiller86 commented 1 year ago

@brtnfld I actually don't know what Mark's tests do, so it is difficult for me to advise.

The tests here are not aimed at testing ZFP library. They are aimed at testing that H5Z-ZFP behaves as expected in that, for example, the same HDF5 file written without and with compression results in a smaller file or that when we look at diffs between compressed and original (uncompressed) data, all the differences we encounter are below a ball-parked tolerance...even for ZFP's rate mode...I just selected a tolerance after examining the data, concluding it looks as expected and then picking a tolerance value that is a small as possible but larger than anything I found in my examination.

brtnfld commented 1 year ago

@markcmiller86, let us know if we need to look into the segfault issue.

markcmiller86 commented 1 year ago

I think I'd like to do some valgrind runs on Linux just to rule out anything obvious. If that doesn't produce anything worth addressing, then I'll go ahead and merge.

markcmiller86 commented 1 year ago

FYI...unrelated to this PR but just mentioning here...I am finding a leak of properties. I didn't implement the callback for deleting the customized property list content the filter is creating. I will fix this in another PR shortly. Other than that, I am not yet seeing any memory faults of any kind.

markcmiller86 commented 1 year ago

So, in my testing on Linux with valgrind, I found only a leak (which I've submitted a PR to fix). I found a UMR being reported too but that was only during printing of cd_vals array from generic interface for precision mode. And, that was due to the fact that one entry in cd_vals for that mode is unused. I don't think it has anything to do with Windows failures.

valgrind of filter as library

quartz2306{miller86}527: valgrind ./test_write_lib zfpmode=3 acc=0.001
==3326628== Memcheck, a memory error detector
==3326628== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3326628== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3326628== Command: ./test_write_lib zfpmode=3 acc=0.001
==3326628== 
    ifile=""                                  set input filename
    ofile="test_zfp.h5"                      set output filename

ZFP compression paramaters...
    zfpmode=3        (1=rate,2=prec,3=acc,4=expert,5=reversible)
    rate=4                                set rate for rate mode
    acc=0.001                     set accuracy for accuracy mode
    prec=11                     set precision for precision mode
    minbits=0                        set minbits for expert mode
    maxbits=4171                     set maxbits for expert mode
    maxprec=64                       set maxprec for expert mode
    minexp=-1074                      set minexp for expert mode

1D dataset generation arguments...
    npoints=1024             set number of points for 1D dataset
    noise=0.001         set amount of random noise in 1D dataset
    amp=17.7             set amplitude of sinusoid in 1D dataset
    chunk=256                      set chunk size for 1D dataset
    doint=0                              also do integer 1D data

Advanced cases...
    highd=0                                4D w/2D chunk example
    sixd=0                             run 6D extendable example
    zfparr=0                requires ZFP>=0.5.4 with CFP enabled
    help=0                                     this help message
==3326628== 
==3326628== HEAP SUMMARY:
==3326628==     in use at exit: 0 bytes in 0 blocks
==3326628==   total heap usage: 2,770 allocs, 2,770 frees, 975,886 bytes allocated
==3326628== 
==3326628== All heap blocks were freed -- no leaks are possible
==3326628== 
==3326628== For lists of detected and suppressed errors, rerun with: -s
==3326628== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

valgrind of filter as plugin

quartz2306{miller86}528: env HDF5_PLUGIN_PATH=`pwd`/../install/plugin valgrind ./test_write_plugin zfpmode=2 prec=5
==3326873== Memcheck, a memory error detector
==3326873== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3326873== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3326873== Command: ./test_write_plugin zfpmode=2 prec=5
==3326873== 
    ifile=""                                  set input filename
    ofile="test_zfp.h5"                      set output filename

ZFP compression paramaters...
    zfpmode=2        (1=rate,2=prec,3=acc,4=expert,5=reversible)
    rate=4                                set rate for rate mode
    acc=0                         set accuracy for accuracy mode
    prec=5                      set precision for precision mode
    minbits=0                        set minbits for expert mode
    maxbits=4171                     set maxbits for expert mode
    maxprec=64                       set maxprec for expert mode
    minexp=-1074                      set minexp for expert mode

1D dataset generation arguments...
    npoints=1024             set number of points for 1D dataset
    noise=0.001         set amount of random noise in 1D dataset
    amp=17.7             set amplitude of sinusoid in 1D dataset
    chunk=256                      set chunk size for 1D dataset
    doint=0                              also do integer 1D data

Advanced cases...
    highd=0                                4D w/2D chunk example
    sixd=0                             run 6D extendable example
    zfparr=0                requires ZFP>=0.5.4 with CFP enabled

3 cd_values=2,0,5,
    help=0                                     this help message
==3326873== 
==3326873== HEAP SUMMARY:
==3326873==     in use at exit: 0 bytes in 0 blocks
==3326873==   total heap usage: 2,760 allocs, 2,760 frees, 1,001,986 bytes allocated
==3326873== 
==3326873== All heap blocks were freed -- no leaks are possible
==3326873== 
==3326873== For lists of detected and suppressed errors, rerun with: -s
==3326873== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
brtnfld commented 1 year ago

I see that Windows's actions passed the first time through with the Valgrind PR. Could be luck, but a good sign.