darshan-hpc / darshan

Darshan I/O characterization tool
Other
55 stars 27 forks source link

Problems when instrumenting MPI applications with HDF5 at runtime #989

Open arcturus5340 opened 1 month ago

arcturus5340 commented 1 month ago

When attempting to instrument DLIO at runtime as follows:

$ env LD_PRELOAD=/home/user/darshan/darshan-runtime/install/lib/libdarshan.so mpirun -np 8 python -m src.dlio_benchmark workload=cosmoflow

I get the following error:

/bin/sh: symbol lookup error: /home/user/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5FDperform_init

I installed Darshan as follows:

$ ./configure --with-log-path=/home/user/darshan-logs --with-jobid-env=PBS_JOBID CC=mpicc --prefix=/home/user/darshan/darshan-runtime/install --enable-hdf5-mod --with-hdf5=/cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/HDF5/1.14.0-iimpi-2022a
$ make
$ make install

And in the output I got:

------------------------------------------------------------------------------
   Darshan Runtime Version 3.4.4 configured with the following features:
           MPI C compiler                - icc
           GCC-compatible compiler       - yes
           NULL          module support  - yes
           POSIX         module support  - yes
           STDIO         module support  - yes
           DXT           module support  - yes
           MPI-IO        module support  - yes
           AUTOPERF MPI  module support  - no
           AUTOPERF XC   module support  - no
           HDF5          module support  - yes (using HDF5 1.14.0)
           PnetCDF       module support  - no
           BG/Q          module support  - no
           Lustre        module support  - yes
           MDHIM         module support  - no
           HEATMAP       module support  - yes
           Memory alignment in bytes     - 8
           Log file env variables        - N/A
           Location of Darshan log files - /home/user/darshan-logs
           Job ID env variable           - PBS_JOBID
           MPI-IO hints                  - romio_no_indep_rw=true;cb_nodes=4

Which means that during installation HDF5 is recognized by the installer (otherwise how would it know the version?)

Next is the output of ldd libdarshan.so, which may prove useful:

    linux-vdso.so.1 (0x00007ffffb752000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x000014d1553bb000)
    librt.so.1 => /lib64/librt.so.1 (0x000014d1551b3000)
    libdl.so.2 => /lib64/libdl.so.2 (0x000014d154faf000)
    liblustreapi.so.1 => /lib64/liblustreapi.so.1 (0x000014d154d6f000)
    libm.so.6 => /lib64/libm.so.6 (0x000014d1549ed000)
    libz.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/zlib/1.2.12-GCCcore-11.3.0/lib/libz.so.1 (0x000014d15577c000)
    libmpifort.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/libmpifort.so.12 (0x000014d154639000)
    libmpi.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/release/libmpi.so.12 (0x000014d152df1000)
    libc.so.6 => /lib64/libc.so.6 (0x000014d152a2c000)
    /lib64/ld-linux-x86-64.so.2 (0x000014d1555db000)
    libjson-c.so.4 => /lib64/libjson-c.so.4 (0x000014d15281c000)
    liblnetconfig.so.4 => /lib64/liblnetconfig.so.4 (0x000014d1525f7000)
    libyaml-0.so.2 => /lib64/libyaml-0.so.2 (0x000014d1523d7000)
    libreadline.so.7 => /lib64/libreadline.so.7 (0x000014d152188000)
    libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x000014d151f84000)
    libgcc_s.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/GCCcore/11.3.0/lib64/libgcc_s.so.1 (0x000014d15575e000)

I will note that running DLIO + HDF5 without Darshan does not cause any problems:

$ mpirun -np 8 python -m src.dlio_benchmark workload=cosmoflow
[INFO] 2024-05-01T12:00:00.000000 Running DLIO with 8 process(es) [/rwthfs/rz/cluster/home/user/dlio_benchmark/src/dlio_benchmark.py:102]
...

I also tried running Darshan with a simple program using HDF5 (code here) and had no problems doing so. So the issue may be related to the fact that Darshan does not track H5FDperform_init.

shanedsnyder commented 1 month ago

Could you try our latest release (3.4.5) and see if you still have the issue? We reworked something in our HDF5 module that I think may resolve this issue.

arcturus5340 commented 1 month ago

I repeated the installation process as described above and now a different but similar error occurred:

/bin/sh: symbol lookup error: /home/kr166361/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5Eset_auto2
shanedsnyder commented 1 month ago

Hmm, maybe there's something still not quite right with how Darshan's HDF5 module interacts with HDF5 libraries at runtime. We've seen similar issues that we've tried to address in recent releases, but maybe need to rethink things again.. I'll see if I can reproduce this with DLIO and think more about it.

I think you could probably avoid the issue entirely by modifying your setting of LD_PRELOAD to additionally reference the HDF5 library : export LD_PRELOAD=/path/to/libdarshan.so:/path/to/libhdf5.so

arcturus5340 commented 1 month ago

Thanks for your help!

hariharan-devarajan commented 1 month ago

I repeated the installation process as described above and now a different but similar error occurred:

/bin/sh: symbol lookup error: /home/kr166361/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5Eset_auto2

So, DLIO installs h5py, which compiles h5py with a specific HDF5 lib. Plus you compile darshan with a specific HDF5 lib. I suspect the h5py version of HDF5 and the version darshan wants might be different. Causing this issue.

Docs How to make sure u install h5py with correct hdf5.

The main idea is to make sure what h5py was compiled with matches the darshan.

arcturus5340 commented 1 month ago

I updated h5py as per the link you provided and reinstalled DLIO, then the version of HDF5 used by the package updated to the one used by Darshan:

user@login18-2:~/dlio_benchmark[1011]$ python
Python 3.10.4 (main, Aug  9 2023, 13:18:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import h5py
>>> h5py.version.hdf5_version
'1.14.0'

However, in spite of this, the error persisted:

mpiexec: symbol lookup error: /home/user/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5Eset_auto2
hariharan-devarajan commented 1 month ago

does ldd on libdarshan.so show hdf5 so and if so is it the same as one u need. If not, u can ldpreload the hdf5 so as well before darshan.so

arcturus5340 commented 1 month ago

No, there is no hdf5 in the ldd output:

$ ldd libdarshan.so
        linux-vdso.so.1 (0x000014ffadb02000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000014ffad6b6000)
        librt.so.1 => /lib64/librt.so.1 (0x000014ffad4ae000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000014ffad2aa000)
        liblustreapi.so.1 => /lib64/liblustreapi.so.1 (0x000014ffad06a000)
        libm.so.6 => /lib64/libm.so.6 (0x000014ffacce8000)
        libz.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/zlib/1.2.12-GCCcore-11.3.0/lib/libz.so.1 (0x000014ffada6d000)
        libmpifort.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/libmpifort.so.12 (0x000014ffac934000)
        libmpi.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/release/libmpi.so.12 (0x000014ffab0ec000)
        libc.so.6 => /lib64/libc.so.6 (0x000014ffaad27000)
        /lib64/ld-linux-x86-64.so.2 (0x000014ffad8d6000)
        libjson-c.so.4 => /lib64/libjson-c.so.4 (0x000014ffaab17000)
        liblnetconfig.so.4 => /lib64/liblnetconfig.so.4 (0x000014ffaa8f2000)
        libyaml-0.so.2 => /lib64/libyaml-0.so.2 (0x000014ffaa6d2000)
        libreadline.so.7 => /lib64/libreadline.so.7 (0x000014ffaa483000)
        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x000014ffaa27f000)
        libgcc_s.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/GCCcore/11.3.0/lib64/libgcc_s.so.1 (0x000014ffada51000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x000014ffaa052000)

However, after adding the path to HDF5 in LD_PRELOAD, everything works fine. Thanks!

hariharan-devarajan commented 1 month ago

I think if u compile darshan with hdf5 the so should be linked to darshan. Maybe it is still a bug. @shanedsnyder thoughts?

shanedsnyder commented 1 month ago

I'll have to dig into it more, but you may be on to something @hariharan-devarajan -- some improper linking of HDF5 could be leading to this error. It is a little tricky though, in that we really don't want the HDF5 library Darshan is using to override what the user wants. E.g., if Darshan was built against a 1.12.x version of HDF5, but the user is trying to build an app against a newer 1.14.x version, then we obviously need to be careful that the 1.12.x libraries aren't used at runtime. I think that's part of the reason that ldd doesn't show HDF5 libraries, as we are intentionally hoping the user provides them at link time. Perhaps this leads to different behavior depending on whether LD_PRELOAD is used or whether Darshan is directly linked into the application (which can't be done with Python).

I'll leave the issue open so I don't forget to investigate. In the meantime, being careful to set LD_PRELOAD to point to both libraries seems to be the way to go.

hariharan-devarajan commented 1 month ago

Additionally, consider incorrect linking at runtime. I think u need ABI compatibility using libtool to ensure they match. In general if you use the c interface of HDF5 mismatch of version wont screw up things but I think u should be linking darshan with the one it compiled with otherwise, it confuses people of what version is needed (or was compiled with) by darshan. HDF5 also has macros as I remember to make sure u do a check at runtime as well. I believe this would need some work to make sure the stack has a consistent view of the libraries to be loaded/needed.