LLNL / H5Z-ZFP

A registered ZFP compression plugin for HDF5
Other
50 stars 22 forks source link

unit test fails #137

Open shaomeng opened 8 months ago

shaomeng commented 8 months ago

I was following the installation steps to compile HDF5, ZFP, and H5Z-ZFP on my system. However, the make check fails with a datatype error:

If using HDF5-1.8, make sure you have patched repack
cc -c print_h5repack_farg.c -o print_h5repack_farg.o -fPIC -I../src
cc print_h5repack_farg.o -o print_h5repack_farg 

h5repack -n     -f UD=32013,0,4,3,0,3539053052,1062232653 ...HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /home/shaomeng/Git/hdf5/CMake-hdf5-1.14.3/hdf5-1.14.3/src/H5T.c line 2052 in H5Tget_class(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /home/shaomeng/Git/hdf5/CMake-hdf5-1.14.3/hdf5-1.14.3/src/H5T.c line 2052 in H5Tget_class(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type

(Note: all other tests succeed except for this one. The full make check output is attached: h5z-zfp.log).

I have HDF5 version 1.14.3, ZFP from the main branch as of 2/24/2024, and H5Z-ZFP from the main branch as of 2/24/2024. I'm not sure if the failed checks are consequential or not.

lindstro commented 8 months ago

I can confirm that h5repack gives this error (with 8-bit word size, as needed). I've not patched my h5repack, but presumably one shouldn't have to with recent versions of HDF5--I, too, have 1.14.3 installed.

@markcmiller86 Any ideas?

markcmiller86 commented 8 months ago

The patch is only necessary for 1.8. Probably something wrong with the testing logic itself. I will take a look but not this week.

markcmiller86 commented 8 months ago

So, after a quick look, I don't see anything wrong with testing logic. The test is intended to h5repack the test file, mesh.h5 producing mesh_repack.h5 which should be smaller than the original by at least 20%. The fact that h5repack itself is reporting an issue suggests that it doesn't like a datatype its encountered in mesh.h5. But, all previous versions of HDF5 have been fine with that file. What does h5dump mesh.h5 or h5ls -vlrd mesh.h5 produce on either your system, @lindstro or your system @shaomeng using the HDF5 1.14.3 you have installed there?

shaomeng commented 8 months ago

Hi Mark, I have the following output from h5dump mesh.h5 (attached txt file). mesh.h5.txt

markcmiller86 commented 8 months ago

So, something has me confused but in test/Makefile here...

https://github.com/LLNL/H5Z-ZFP/blob/092190c38d5760b3212ce4d96cced767b90aa189/test/Makefile#L382-L385

there is a redirection of both stdout and stderr with 2>&1 1>/dev/null which means we should NOT SEE any HDF5 error logging anywhere. But, we are. Why? It might be worth removing that rediction and doing make test again and see what else it indicates is going wrong.

Next, it appears to complete the h5repack with 0 return value because it is NOT exiting the test here...

https://github.com/LLNL/H5Z-ZFP/blob/092190c38d5760b3212ce4d96cced767b90aa189/test/Makefile#L385-L389

Instead, it is exiting at the check of the size ratio. So, I think h5repack has encountered some unknown issue with the datatypes in mesh.h5 and refused to compress the file. You could confirm by looking for an mesh_repack.h5 and see if it is smaller (or different using h5diff) than mesh.h5.

lindstro commented 8 months ago

It produces the kind of output you might expect, with no warnings or errors. Output attached.

h5dump.txt.gz h5ls.txt.gz

lindstro commented 8 months ago

Could it be that H5T_STD_U8LE and H5T_STD_I8LE are not handled? zfp supports only int32, int64, float, and double.

Here's part of the output I'm seeing:

h5repack -n     -f UD=32013,0,4,3,0,3539053052,1062232653 ...HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /tmp/hdf5-20240118-4051-ziphrl/hdf5-1.14.3/src/H5T.c line 2052 in H5Tget_class(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type
...
 [FAILED (size ratio) ]
markcmiller86 commented 8 months ago

Hmm...I think you could be right. In the mesh.h5 most of the datasets are int, float or double. However, there are a few there which are not...

[scratlantis:zfp_filter/H5Z-ZFP/test] miller86% h5ls -vlr mesh.h5 | grep -B 4 signed
/Pressure                Dataset {10/10, 20/20, 30/30}
    Location:  1:42412
    Links:     1
    Storage:   24000 logical bytes, 24000 allocated bytes, 100.00% utilization
    Type:      native unsigned int
--
/Pressure_2D             Dataset {20/20, 30/30}
    Location:  1:1400
    Links:     1
    Storage:   600 logical bytes, 600 allocated bytes, 100.00% utilization
    Type:      native unsigned char
--
/VelocityX_2D            Dataset {21/21, 31/31}
    Location:  1:1672
    Links:     1
    Storage:   651 logical bytes, 651 allocated bytes, 100.00% utilization
    Type:      native signed char

What happens for you if you adjust the Makefile to read...

             -l X,Y,Z,Indexes:CHUNK=217 \
             -l Indexes2:CHUNK=1517 \
             -l Pressure2,Pressure3:CHUNK=10x20x5 \
             -l Stress,Velocity,Stress2,Velocity2,Stress3,Velocity3,VelocityZ,VelocityZ2,VelocityZ3:CHUNK=11x21x1x1 \
             -l XY:CHUNK=651x1 \
             -l XYZ:CHUNK=217x1 \
             -l XYZ2:CHUNK=1617x1 \
             -l XYZ3:CHUNK=33x1 \

which removes the request to repack Pressure, Pressure_2D and VelocityX_2D?

markcmiller86 commented 8 months ago

Well, I think its just VelocityX_2D and Pressure_2D that are the culprits. Interesting thing...earlier versons of HDF5 I think were silently skipping these or maybe h5repack continued past them in earlier versions of HDF5 but now maybe stops upon encountering the error?

lindstro commented 8 months ago

What happens for you if you adjust the Makefile to read...

Same errors as far as I can tell.

lindstro commented 8 months ago

FWIW, the test passes if I leave out HDF5_PLUGIN_PATH=../src/plugin. Not sure what HDF5 does in this case when requesting zfp compression.

markcmiller86 commented 8 months ago

So, I just tried with master on HDF5-1.14.3 with gcc on macOS and do not see this error. I did a make clean first.

markcmiller86 commented 8 months ago

FWIW, the test passes if I leave out HDF5_PLUGIN_PATH=../src/plugin. Not sure what HDF5 does in this case when requesting zfp compression.

It should fail the size ratio if you do that. Mine does....

Plugin precision tests ...................................... [PASSED] 
If using HDF5-1.8, make sure you have patched repack

h5repack -n     -f UD=32013,0,4,3,0,3539053052,1062232653 ... [FAILED (size ratio) ]

./test_write_plugin zfpmode=5 ............................... [PASSED] 

h5diff -v -d 0.00001 test_zfp_le.h5 test_zfp_be.h5 compressed compressed ........ [PASSED] 

h5dump bigendian.h5 ............................................................. [PASSED] 

I wonder if it is somehow finding an older version of the plugin filter code somewhere in your path, ld lib var or something?

markcmiller86 commented 8 months ago

BTW...I am working from H5Z-ZFP master branch and ZFP develop branch.

shaomeng commented 8 months ago

So, I just tried with master on HDF5-1.14.3 with gcc on macOS and do not see this error. I did a make clean first.

I tried make clean and also the remove of the request to repack Pressure, Pressure_2D and VelocityX_2D, it's still giving the same error.

In my case, I'm pretty sure that this is the only version of H5Z-ZFP I have; I just started to interact with this piece of software a few days ago ;) (And yes, I'm also using H5Z-ZFP master, ZFP develop.) I'm having this problem on ubuntu 22.04 though.

lindstro commented 8 months ago

Some potential differences: I'm building H5Z-ZFP using GNU make and AppleClang 15.0.0. I'm using the master branch of both H5Z-ZFP and zfp (I had some vague recollection that there was a weird interaction between H5Z-ZFP and the zfp develop branch, but maybe that was pre 1.0.0).

I wonder if it is somehow finding an older version of the plugin filter code somewhere in your path, ld lib var or something?

LD_LIBRARY_PATH is not set, and I can confirm using DYLD_PRINT_LIBRARIES the correct path to the plugin.

markcmiller86 commented 8 months ago

@lindstro or @shaomeng when you get a chance, can you attach a copy of mesh_repack.h5 you get (assuming you do indeed get one...I think you are because its failing the size ratio check, not the first step in performing the repack)?

shaomeng commented 8 months ago

@lindstro or @shaomeng when you get a chance, can you attach a copy of mesh_repack.h5 you get (assuming you do indeed get one...I think you are because its failing the size ratio check, not the first step in performing the repack)?

Hi Mark, I do get one! Please see the attachment mesh_repack.h5.tar.gz

lindstro commented 8 months ago

So my mesh_repack.h5 does not contain any zfp magic numbers (0x0570667a in little endian), suggesting that compression is not even being invoked.

markcmiller86 commented 8 months ago

So my mesh_repack.h5 does not contain any zfp magic numbers (0x0570667a in little endian), suggesting that compression is not even being invoked.

Well, is the mesh_repack.h5 you are looking at one you generated by disabling setting of HDF5_PLUGIN_PATH as you mentioned above

markcmiller86 commented 8 months ago

Hi Mark, I do get one! Please see the attachment mesh_repack.h5.tar.gz

Thanks. That file appears to be an exact copy of mesh.h5 with no datasets compressed. So, the repack operation produced an output but no compression was employed. I don't think h5repack will return non-zero return value in this case. I will have to check. But, pretty sure it returned 0 implying the operation succeeded. So, it went on to compare the sizes of the files and then failed the size ratio as @shaomeng original log shows (e.g. [FAILED (size ratio) ]).

So, the question is, why isn't it applying the compression? The warning messages about not a data type are certainly a clue. I am still trying to get to bottom of it.

@lindstro is your mac Intel or Apple chips?

markcmiller86 commented 8 months ago

I am thinking difference in behavior may have to do with how we have respectively compiled HDF5 1.14.3. Can each of you also past your CMake or autoconf command-line for how you configured HDF5?

lindstro commented 8 months ago

Well, is the mesh_repack.h5 you are looking at one you generated by disabling setting of HDF5_PLUGIN_PATH as you mentioned https://github.com/LLNL/H5Z-ZFP/issues/137#issuecomment-1967545327

No, the path is correct and the plugin is loaded:

dyld[60726]: <FBD8E492-073B-3FDF-A6D2-3FE5D0E53BEA> /.../H5Z-ZFP/src/plugin/libh5zzfp.so
dyld[60726]: <1B084037-309A-3541-8B17-20022829C01E> /opt/homebrew/Cellar/hdf5/1.14.3/lib/libhdf5.310.3.0.dylib
dyld[60726]: <BC330C52-DB91-346D-830B-514BDC7F1043> /.../H5Z-ZFP/zfp/lib/libzfp.1.0.1.dylib

@lindstro is your mac Intel or Apple chips?

I have an Apple M2 Max.

I am thinking difference in behavior may have to do with how we have respectively compiled HDF5 1.14.3. Can each of you also past your CMake or autoconf command-line for how you configured HDF5?

I couldn't get CMake to override the path to zfp (I tried ZFP_DIR); I don't know how it's finding zfp under my home directory. So I used GNU make:

make ZFP_HOME=`pwd`/zfp HDF5_HOME=/opt/homebrew/Cellar/hdf5/1.14.3 HDF5_PLUGIN_PATH=`pwd`/src/plugin all
markcmiller86 commented 8 months ago

@jhendersonHDF wanted to ask if you could have a looksee here briefly and see if anything looks familiar. It kinda sounds like this HDF forum issue but we're not seg-faulting. Its just like h5repack is deciding it cannot invoke the requested filtering operations due to some issue in recognizing the datatype.

For reference, here is the command that works for me but fails for the other two users here...

env \
LD_LIBRARY_PATH=/Users/miller86/silo/hdf5-1.14.3/build_serial/myinstall/lib: \
HDF5_PLUGIN_PATH=../src/plugin
/Users/miller86/silo/hdf5-1.14.3/build_serial/myinstall/bin/h5repack \
-n -f UD=32013,1,4,3,0,3539053052,1062232653 \
-l X,Y,Z,Indexes:CHUNK=217 \
-l Indexes2:CHUNK=1517 \
-l Pressure,Pressure2,Pressure3:CHUNK=10x20x5 \
-l Pressure_2D:CHUNK=10x20 \
-l Stress,Velocity,Stress2,Velocity2,Stress3,Velocity3,VelocityZ,VelocityZ2,VelocityZ3:CHUNK=11x21x1x1 \
-l VelocityX_2D:CHUNK=21x31 \
-l XY:CHUNK=651x1 \
-l XYZ:CHUNK=217x1 \
-l XYZ2:CHUNK=1617x1 \
-l XYZ3:CHUNK=33x1 \
mesh.h5 mesh_repack.h5
jhendersonHDF commented 8 months ago

@markcmiller86 Is it possible that you built a shared library only version of HDF5 while others here have built either a static only version or a static + shared version? This looks suspiciously similar to problems I've seen in the past where using an HDF5 tool which is statically linked to HDF5 along with a dynamically loaded plugin that is linked to the shared version of the library causes two HDF5s to be loaded into memory and results in very odd issues. If an HDF5 build was built with both static and shared, there will be two versions of each tool, e.g. h5dump and h5dump-shared and the shared versions are the correct ones to use in this case. This has been a problem often enough that we're finally correcting this in https://github.com/HDFGroup/hdf5/pull/4046 by only ever building one set of tools and linking to the shared library by default if available.

markcmiller86 commented 8 months ago

@jhendersonHDF thanks for the quick look. I do indeed have shared and static build.

otool -L ~/silo/hdf5-1.14.3/build_serial/myinstall/bin/h5repack 
/Users/miller86/silo/hdf5-1.14.3/build_serial/myinstall/bin/h5repack:
    /Users/miller86/silo/hdf5-1.14.3/build_serial/myinstall/lib/libhdf5.310.dylib (compatibility version 314.0.0, current version 314.0.0)
    /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.11)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3)
shaomeng commented 8 months ago

I am thinking difference in behavior may have to do with how we have respectively compiled HDF5 1.14.3. Can each of you also past your CMake or autoconf command-line for how you configured HDF5?

Hi Mark, I followed this instruction to build HDF5, and I didn't change anything besides the installation prefix. I do see that I have both h5repack and h5repack-shared in the installed bin directory, so it might be the issue??

markcmiller86 commented 8 months ago

@lindstro I have intel Mac but I do have an M1 I can try to test on. Also, my 1.0.0 zfp lib by default builds static, libzfp.a. When H5Z-ZFP is built, it pulls that statically into the .so file that gets created for the HDF5 filter/plugin. Thats how it is "finding" zfp. Its statically built into the plugin.

jhendersonHDF commented 8 months ago

@markcmiller86 Interesting that your h5repack seems to be linking to the shared library version of HDF5 if you have a static + shared build; I would've expected it to be linking to the static library. But since your version is linking to the shared library, that's likely why you have no issues. For others here, I would try editing what's necessary to make the tests use h5repack-shared instead and see if that works. By default, HDF5's CMake should build both static and shared libs and you'll end up with this problem since the expected name for the tool (h5repack, for instance) actually points to the statically-linked tool.

lindstro commented 8 months ago

I used homebrew to install HDF5.

otool -L `where h5repack`
/opt/homebrew/bin/h5repack:
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.61.1)
    /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.12)
    /opt/homebrew/opt/libaec/lib/libsz.2.dylib (compatibility version 2.0.0, current version 2.0.1)

Evidently it was linked to the static library.

markcmiller86 commented 8 months ago

Evidently it was linked to the static library.

Ok, so thats a problem. A statically linked tool cannot load a plugin. So, the test cannot be properly executed. So, we should detect that and just fail associated tests.

This must be the explanation for failed test here.

jhendersonHDF commented 8 months ago

So, we should detect that and just fail associated tests.

You might also be able to detect if the "-shared" version of a tool exists and use that if available.

lindstro commented 8 months ago

So, we should detect that and just fail associated tests.

You might also be able to detect if the "-shared" version of a tool exists and use that if available.

Indeed, the test passes with h5repack-shared.

markcmiller86 commented 8 months ago

@shaomeng see if this branch, mcm86-28feb24-static-hdf5-failures works for you.

markcmiller86 commented 8 months ago

@jhendersonHDF thanks for your attention and time...probably saved me a few hours 💪🏻

shaomeng commented 8 months ago

@shaomeng see if this branch, mcm86-28feb24-static-hdf5-failures works for you.

Hi Mark, the new branch still fails, with the same errors.

If I manually change the command in make check from h5repack to h5repack-shared then the tests pass.

jhendersonHDF commented 8 months ago

@markcmiller86 You probably need to check if the static lib is present rather than checking if the shared lib is missing. If static is present at all then the regular named tools should end up linked statically. If the static lib is present you should either skip the test like in that PR or prefer the -shared tool when the static lib is present and the regular tool when it's not.

jhendersonHDF commented 8 months ago

That is, I guess you should aim for HDF5_IS_SHARED_ONLY rather than HDF5_IS_STATIC_ONLY

markcmiller86 commented 8 months ago

Ok, thanks for info @jhendersonHDF and @shaomeng. I will return to this next week. Other fires to put out now.

markcmiller86 commented 8 months ago

Something else that really troubles me here is that many of the tests use HDF5 tools (h5diff, h5dump, and h5repack) to confirm tests passed or failed. So, IMHO, if the test involving h5repack failed for @shaomeng because it was a statically linked executable, then many of the other tests should have failed for the same reason. Yet, his .log file shows them all passing. I worry that somehow the testing logic is getting defeated...that those tests also should have failed but the detection logic isn't catching that.

markcmiller86 commented 8 months ago

Ok, yeah, I am seeing some issues with test logic.

jhendersonHDF commented 8 months ago

@markcmiller86 I mostly use CMake these days, but as far as I can tell the default in Autotools is to always build both shared and static (same as CMake), but link to the shared library by default if available. I believe the "-shared" tools concept has always been unique to the CMake builds, which complicates things further. When https://github.com/HDFGroup/hdf5/pull/4046 is merged, CMake and Autotools should have identical behavior, but that doesn't really help the current situation.