anderkve / gambit_np

0 stars 1 forks source link

HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0 [...] evict on close is currently not supported in parallel HDF5] #8

Closed fzeiser closed 3 years ago

fzeiser commented 3 years ago

Trying to run gambit on fram I get following error:

(py3.8) [fabiobz@login2.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ ./gambit -rf yaml_files/NuclearBit_demo.yaml 

GAMBIT 1.5.0
http://gambit.hepforge.org

HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5Pfapl.c line 4536 in H5Pset_evict_on_close(): evict on close is currently not supported in parallel HDF5
    major: Property lists
    minor: Feature is unsupported
Abort(2664079) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1421): 
MPIDU_bc_table_create(311)...: 
Logger was never initialised! Creating default log messenger...
[login2:187279:0:187279] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 187279) ====
 0 0x0000000000050ba5 ucs_debug_print_backtrace()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/debug/debug.c:625
 1 0x00000000000d0b52 std::_Rb_tree_insert_and_rebalance()  /cluster/work/users/vegarde/build/GCCcore/9.3.0/system-system/gcc-9.3.0/stage3_obj/x86_64-pc-linux-gnu/libstdc++-v3/src/c++98/../../../../../libstdc++-v3/src/c++98/tree.cc:235
 2 0x0000000000dc5457 Gambit::exception::exception()  ???:0
 3 0x0000000000dcd940 Gambit::error::error()  ???:0
 4 0x0000000000e2fee2 Gambit::utils_error()  ???:0
 5 0x0000000000dd944c Gambit::GMPI::Comm::Comm()  ???:0
 6 0x0000000000e3fa1e Gambit::Utils::runtime_scratch[abi:cxx11]()  ???:0
 7 0x0000000000dad035 Gambit::Logging::LogMaster::emit_backlog()  :0
 8 0x0000000000dab8b3 Gambit::Logging::LogMaster::~LogMaster()  :0
 9 0x0000000000039c99 __run_exit_handlers()  :0
10 0x0000000000039ce7 __GI_exit()  :0
11 0x0000000000a57206 MPL_exit()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/src/mpl/../../../../src/mpl/src/msg/mpl_msg.c:90
12 0x00000000001837df MPID_Abort()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_globals.c:153
13 0x0000000000279107 MPIR_Handle_fatal_error()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/errhan/errutil.c:457
14 0x0000000000400e40 PMPI_Init()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/init/init.c:157
15 0x0000000000de8b2b Gambit::GMPI::Init()  ???:0
16 0x00000000004ffd5c main()  :0
17 0x0000000000022505 __libc_start_main()  ???:0
18 0x00000000004ffbe9 _start()  ???:0
=================================
Segmentation fault (core dumped)

compiler setting:

module load Python/3.8.2-GCCcore-9.3.0 CMake/3.16.4-GCCcore-9.3.0 Boost/1.72.0-iimpi-2020a h5py/2.10.0-intel-2020a-Python-3.8.2 GSL/2.6-GCC-9.3.0 matplotlib/3.2.1-intel-2020a-Python-3.8.2 SciPy-bundle/2020.03-intel-2020a-Python-3.8.2 ScaLAPACK/2.1.0-gompi-2020a Eigen/3.3.7 iimpi/2020a intel/2020a module remove OpenMPI/4.0.3-GCC-9.3.0 source /cluster/projects/nn9464k/progs/py3.8/bin/activate

export PYTHON_ENV=/cluster/projects/nn9464k/progs/py3.8 export PYTHON_EXECUTABLE=$PYTHON_ENV/bin/python3.8 export PYTHON_INCLUDE_DIR=/cluster/software/Python/3.8.2-GCCcore-9.3.0/include/python3.8 export PYTHON_LIBRARY=/cluster/software/Python/3.8.2-GCCcore-9.3.0/lib/libpython3.8.so.1.0 export EIGEN3_INCLUDE_DIR=/cluster/software/Eigen/3.3.7/include

cmake -D PYTHON_EXECUTABLE=$PYTHON_EXECUTABLE -D PYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR -D PYTHON_LIBRARY=$PYTHON_LIBRARY -Ditch="Mathematica;great;ColliderBit;CosmoBit;DarkBit;DecayBit;ExampleBit_A;ExampleBit_B;FlavBit;NeutrinoBit;PrecisionBit;SpecBit" -DWITH_MPI=ON -DWITH_HEPMC=Off -DWITH_ROOT=Off -D EIGEN3_INCLUDE_DIR=$EIGEN3_INCLUDE_DIR -D CMAKE_C_COMPILER=mpiicc -D CMAKE_CXX_COMPILER=mpiicpc -D CMAKE_Fortran_COMPILER=mpiifort ..

fzeiser commented 3 years ago

It might be wrong to set the compilers to mpiicc instead of icc (...) but not sure whether this is the reason for this. I initially did this when i had the problems in #7, but as I found out, that was probably not the issue / not necessary.

Any ideas @anderkve?

fzeiser commented 3 years ago

E.g. what version of HDF5 do you use on the cluster you run gambit?

fzeiser commented 3 years ago

Recompiled with icc instead of mpiicc, but as somehow expected, that did not help.

fzeiser commented 3 years ago

Interesting :)

https://github.com/anderkve/gambit_np/blob/3183fff753376523730eda1920711f221224a9b5/Printers/src/printers/hdf5printer/hdf5tools.cpp#L53-L60

Attaching also the CMakeCache.txt

with the relevant section on the HDF5 library:

//HDF5 C Wrapper compiler.  Used only to detect HDF5 compile flags.
HDF5_C_COMPILER_EXECUTABLE:FILEPATH=/cluster/software/HDF5/1.10.6-iimpi-2020a/bin/h5pcc

//Path to a library.
HDF5_C_LIBRARY_dl:FILEPATH=/usr/lib64/libdl.so

//Path to a library.
HDF5_C_LIBRARY_hdf5:FILEPATH=/cluster/software/HDF5/1.10.6-iimpi-2020a/lib/libhdf5.so

//Path to a library.
HDF5_C_LIBRARY_iomp5:FILEPATH=/cluster/software/imkl/2020.1.217-iimpi-2020a/lib/intel64/libiomp5.so

//Path to a library.
HDF5_C_LIBRARY_m:FILEPATH=/usr/lib64/libm.so

//Path to a library.
HDF5_C_LIBRARY_pthread:FILEPATH=/usr/lib64/libpthread.so

//Path to a library.
HDF5_C_LIBRARY_sz:FILEPATH=/cluster/software/Szip/2.1.1-GCCcore-9.3.0/lib/libsz.so

//Path to a library.
HDF5_C_LIBRARY_z:FILEPATH=/cluster/software/zlib/1.2.11-GCCcore-9.3.0/lib/libz.so

//HDF5 file differencing tool.
HDF5_DIFF_EXECUTABLE:FILEPATH=/cluster/software/HDF5/1.10.6-iimpi-2020a/bin/h5diff

//The directory containing a CMake configuration file for HDF5.
HDF5_DIR:PATH=HDF5_DIR-NOTFOUND
anderkve commented 3 years ago

So the mpiiccvs icc choice should not be the problem, I think. On the current cluster where I run GAMBIT I use icc (etc), and I'm pretty sure that's what we've done on other clusters as well.

A more likely source of problem is the fact that the hdf5 library can be built in parallel or serial mode. It looks like the version you are using on fram is the parallel one.

The hdf5 output in GAMBIT is serial (everything is written to a single hdf5 file), and we have had problems with this system running with the parallel build of hdf5. (We've had other problems as well, but that's a different story... :P ) Is there an option to use the serial hdf5 on fram? Or if not, perhaps you can build it yourself?

On the cluster I currently run GAMBIT, I use hdf5 1.8.20 (serial), and on my laptop I use hdf5 1.10.6+repack-2 (serial).

One more thing: If you are running with many MPI processes, and each parameter point is very fast, the hdf5 file writing can become a bottleneck. In this case it can help to increase the buffer length of the GAMBIT Printer system. Just add a buffer_length option in the list of hdf5 printer options in the yaml file. (The default value for the hdf5 printer is buffer_length: 1000)

fzeiser commented 3 years ago

The hdf5 output in GAMBIT is serial (everything is written to a single hdf5 file), and we have had problems with this system running with the parallel build of hdf5. (We've had other problems as well, but that's a different story... :P ) Is there an option to use the serial hdf5 on fram? Or if not, perhaps you can build it yourself?

Thanks, this confirms my suspicion. I currently use a parallel build of hdf5. I will try to compile it in serial mode myself (it doesn't seem to be provided as a "standard"). I just really wanted to hear with you what system you use before I sent a variable amount (hopefully, but not necessarily short) of time on getting the prerequirements for the hdf compilation :).

anderkve commented 3 years ago

Yeah, makes sense. :) Hope it's not too horrible to get working...

fzeiser commented 3 years ago

I build HDF5 with out the parallel option (setting usempi=False in the easybuild toolchain). I also checked with h5cc -showconfig that the Parallel HDF5 feature is disabled.

However, now I receive following error:

(py3.8) [fabiobz@login2.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ ./gambit -rf yaml_files/NuclearBit_demo.yaml 

GAMBIT 1.5.0
http://gambit.hepforge.org

Abort(2664079) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1421): 
MPIDU_bc_table_create(311)...: 
Logger was never initialised! Creating default log messenger...

Gambit has encountered an uncaught error during initialisation.

Check the output logs for details.

(Check your yaml file if you can't recall where the logs are.)

what(): GAMBIT error
ERROR: A problem has been raised by one of the utility codes.
Error creating Comm object (wrapper for MPI communicator)! MPI has not been initialised!
Raised at: line 59 in function Gambit::GMPI::Comm::Comm() of /cluster/projects/nn9464k/progs/gambit_np/Utils/src/mpiwrapper.cpp.

Something that I find odd here is that the error comes up even if I only request the cout logger in gambit. Does that make sense / could this still be related to the hdf5 issue? I realized that I didn't check whether there is a in-between mode, building with mpi support but without parallel (though I wouldn't know what that should mean :).

Here's the CMakeCache.txt.

[+ As suspected from the HDF5 path used previously (very first comments), the previous HDF5 version I used had Parallel HDF5: yes]

fzeiser commented 3 years ago

Eventhough I don't know why, I may well be related to my HDF5 installation (and h5py). At least I see that I can't run primitive examples like this with the current combination of HDF5 and h5py.

fzeiser commented 3 years ago

Reinstalled h5py (and unloaded the module before). Now both the C and python example from hdf5 work. Will try to recompile gambit again (...).

fzeiser commented 3 years ago

Unfortunately, it didn't help to have the updated HDF5 and h5py libraries. -- Though, as I said, the error looks somehow different. Any ideas on what I can try to pinpoint the error?

anderkve commented 3 years ago

About the hdf5 serial/parallel: It is perhaps equivalent to adjusting the usempi setting like you did, but I noticed that the "official" hdf5 configure option is --enable-parallel: https://support.hdfgroup.org/HDF5/faq/parallel.html

Though, as you say, the fact that you get the above error even with the cout printer hints that the current problem is no longer related to hdf5... I don't think I've ever seen the error above before. :P

I noticed the above error happens when you call GAMBIT directly as ./gambit -rf yaml_files/NuclearBit_demo.yaml. What happens if you call it with mpirun or mpiexec? E.g. mpiexec -np 2 ./gambit -rf yaml_files/NuclearBit_demo.yaml.

Also, I assume silly mpi tests like just doing mpiexec -np 2 echo "hello" works as expected?

anderkve commented 3 years ago

Also, from the CMakeCache.txt file it looks like a lot of the loaded packages have been compiled with gcc (the string GCCcore-9.3.0 appears in a lot of paths), including Python 3.8. It's probably not related, but given that you probably have loaded some Intel compiler module, I would have expected the module system to use packages compiled with the Intel compilers...

One fallback option you could try is of course to not use the Intel compilers, but rather plain ol' gcc. Would expect slightly worse performance, but might be less hassle to get working.

fzeiser commented 3 years ago

About the hdf5 serial/parallel: It is perhaps equivalent to adjusting the usempi setting like you did, but I noticed that the "official" hdf5 configure option is --enable-parallel: https://support.hdfgroup.org/HDF5/faq/parallel.html

As far as I read hdf5 would/should for newer versions recognize automatically whether one has mpi support and default the flags accordingly.

Anyhow, from the easybuild config i get the following:

        # MPI and C++ support enabled requires --enable-unsupported, because this is untested by HDF5
        # also returns False if MPI is not supported by this toolchain
        if self.toolchain.options.get('usempi', None):
            self.cfg.update('configopts', "--enable-unsupported --enable-parallel")
            mpich_mpi_families = [toolchain.INTELMPI, toolchain.MPICH, toolchain.MPICH2, toolchain.MVAPICH2]
            if self.toolchain.mpi_family() in mpich_mpi_families:
                self.cfg.update('buildopts', 'CXXFLAGS="$CXXFLAGS -DMPICH_IGNORE_CXX_SEEK"')
            # Skip MPI cxx extensions to avoid hard dependency
            if self.toolchain.mpi_family() == toolchain.OPENMPI:
                self.cfg.update('buildopts', 'CXXFLAGS="$CXXFLAGS -DOMPI_SKIP_MPICXX"')
        else:
            self.cfg.update('configopts', "--disable-parallel")

As I checked with h5cc -showconfig that the Parallel HDF5 feature is disabled (as expected from my config file).

Though, as you say, the fact that you get the above error even with the cout printer hints that the current problem is no longer related to hdf5... I don't think I've ever seen the error above before. :P

I noticed the above error happens when you call GAMBIT directly as ./gambit -rf yaml_files/NuclearBit_demo.yaml. What happens if you call it with mpirun or mpiexec? E.g. mpiexec -np 2 ./gambit -rf yaml_files/NuclearBit_demo.yaml.

The error is just slightly different, but includes also something about Python :/

(py3.8) [fabiobz@login3.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ mpiexec -np 2 ./gambit -rf yaml_files/NuclearBit_demo.yaml
Fatal Python error: init_sys_streams: <stdin> is a directory, cannot continue
Python runtime state: core initialized

Current thread 0x00002aff7b610500 (most recent call first):
<no Python frame>

GAMBIT 1.5.0
http://gambit.hepforge.org

Logger was never initialised! Creating default log messenger...
terminate called after throwing an instance of 'Gambit::exception'
  what():  GAMBIT error
ERROR: A problem has been raised by one of the utility codes.
Error creating Comm object (wrapper for MPI communicator)! MPI has not been initialised!
Raised at: line 59 in function Gambit::GMPI::Comm::Comm() of /cluster/projects/nn9464k/progs/gambit_np/Utils/src/mpiwrapper.cpp.
[...]

Also, I assume silly mpi tests like just doing mpiexec -np 2 echo "hello" works as expected?

Yes, they do

fzeiser commented 3 years ago

Also, from the CMakeCache.txt file it looks like a lot of the loaded packages have been compiled with gcc (the string GCCcore-9.3.0 appears in a lot of paths), including Python 3.8. It's probably not related, but given that you probably have loaded some Intel compiler module, I would have expected the module system to use packages compiled with the Intel compilers...

One fallback option you could try is of course to not use the Intel compilers, but rather plain ol' gcc. Would expect slightly worse performance, but might be less hassle to get working.

Yes, I can give it a try. The reason that I have several things compiles with gcc and others with intel is that I tried to use precompiled modules existing on fram as far as possible. I can try to find a full set of gcc modules and recompile gambit with it. Will report on the status.

fzeiser commented 3 years ago

Finally I'm getting there starting to get there CMakeCache.txt

It run's almost as I'd like. Either using ./gambit [...], or mpiexec -np 2 [...] gambit will take forever(?) in Calling MPI_Finalize.... This is regardless of whether I use the hdf5 printer. I killed the process after 2 or 3 minutes.

Here the output (when asking for the cout printer only):

(py3.8) [fabiobz@login3.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ ./gambit -rf yaml_files/NuclearBit_demo.yaml

GAMBIT 1.5.0
http://gambit.hepforge.org

Descriptions are missing for the following models:
   GenericModel5
   GenericModel10
   GenericModel15
   GenericModel20
   GSFModel20
   GSF_GLO_CT_Model20
   GSF_EGLO_CT_Model20
   GSF_MGLO_CT_Model20
   GSF_GH_CT_Model20
   GSF_constantM1
   NLDModelCT_and_discretes
   NLDModelBSFG_and_discretes
Please add descriptions of these to /cluster/projects/nn9464k/progs/gambit_np/config/models.dat
Descriptions are missing for the following capabilities:
   GSFModel20_parameters
   GSF_EGLO_CT_Model20_parameters
   GSF_GH_CT_Model20_parameters
   GSF_GLO_CT_Model20_parameters
   GSF_MGLO_CT_Model20_parameters
   GSF_constantM1_parameters
   GenericModel10_parameters
   GenericModel15_parameters
   GenericModel20_parameters
   GenericModel5_parameters
   NLDModelBSFG_and_discretes_parameters
   NLDModelCT_and_discretes_parameters
   gledeliBE_1_0_init
   gledeliBE_get_results
   gledeliBE_run
   gledeliBE_set_model_names
   gledeliBE_set_model_pars
   gledeliLogLike
   gledeliResults
   zeroLogLike
Please add descriptions of these to /cluster/projects/nn9464k/progs/gambit_np/config/capabilities.dat

Starting GAMBIT
----------
Running in MPI-parallel mode with 1 processes
----------
Running with 1 OpenMP threads per MPI process (set by the environment variable OMP_NUM_THREADS).
YAML file: yaml_files/NuclearBit_demo.yaml
Initialising logger...  log_debug_messages = true; log messages tagged as 'Debug' WILL be logged. 
WARNING: This may lead to very large log files!
Resolving dependencies and backend requirements.  Hang tight...
...done!
Starting scan.
ScannerBit is waiting for all MPI processes to join the scan...
  All processes ready!
Entering random sampler.
  number of points to calculate:  5
MPI process rank: 0
[...]

0, 5: pointID: 5
0, 5: MPIrank: 0
ScannerBit is waiting for all MPI processes to report their shutdown condition...

GAMBIT has finished successfully!

Calling MPI_Finalize...
anderkve commented 3 years ago

Interestingly, I find only MPIrank: 0 in the output. I would have expected ranks 0 and 1 when using -np 2

Thanks -- I remember spotting this issue in GAMBIT a little while ago while, but we haven't had time to fix it yet. It seems to only be some problem with the cout printer, the logs are still correct and shows how parameter points where actually distributed across the MPI processes. (The hdf5 printer should work fine.)

gambit will take forever(?) in Calling MPI_Finalize...

This sounds familiar. Will have a look in the main GAMBIT repo and emails to see if this has come up before.

anderkve commented 3 years ago

Which scanner are you using when you see the MPI_Finalize problem?

fzeiser commented 3 years ago

Which scanner are you using when you see the MPI_Finalize problem?

I use random.

fzeiser commented 3 years ago

Interestingly, I find only MPIrank: 0 in the output. I would have expected ranks 0 and 1 when using -np 2

Thanks -- I remember spotting this issue in GAMBIT a little while ago while, but we haven't had time to fix it yet. It seems to only be some problem with the cout printer, the logs are still correct and shows how parameter points where actually distributed across the MPI processes. (The hdf5 printer should work fine.)

Yes, probably it does - if gambit was to finalize and then write out the hdf5 file...

fzeiser commented 3 years ago

Trying to run a different scanner, diver, but getting following error:

Starting GAMBIT
----------
Running in MPI-parallel mode with 2 processes
----------
Running with 2 OpenMP threads per MPI process (set by the environment variable OMP_NUM_THREADS).
YAML file: yaml_files/NuclearBit_demo.yaml
Initialising logger...  log_debug_messages = true; log messages tagged as 'Debug' WILL be logged. 
WARNING: This may lead to very large log files!
Resolving dependencies and backend requirements.  Hang tight...
...done!
Starting scan.
ScannerBit is waiting for all MPI processes to join the scan...
  All processes ready!

 FATAL ERROR

GAMBIT has exited with fatal exception: GAMBIT error
ERROR: A problem has been raised by ScannerBit.
Cannot load /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/lib/libscanner_diver_1.0.4.so:  /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/installed/diver/1.0.4/lib/libdiver.so: undefined symbol: __svml_exp2

Raised at: line 136 in function const std::map<Gambit::type_index, void*>& Gambit::Scanner::Plugins::Plugin_Interface_Base::initPlugin(const string&, const string&, const plug_args& ...) [with plug_args = {unsigned int, Gambit::Scanner::Factory_Base}; std::string = std::__cxx11::basic_string<char>] of /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/include/gambit/ScannerBit/plugin_interface.hpp.

 FATAL ERROR

GAMBIT has exited with fatal exception: GAMBIT error
ERROR: A problem has been raised by ScannerBit.
Cannot load /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/lib/libscanner_diver_1.0.4.so:  /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/installed/diver/1.0.4/lib/libdiver.so: undefined symbol: __svml_exp2

Raised at: line 136 in function const std::map<Gambit::type_index, void*>& Gambit::Scanner::Plugins::Plugin_Interface_Base::initPlugin(const string&, const string&, const plug_args& ...) [with plug_args = {unsigned int, Gambit::Scanner::Factory_Base}; std::string = std::__cxx11::basic_string<char>] of /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/include/gambit/ScannerBit/plugin_interface.hpp.
Calling MPI_Finalize...
[1611830459.506291] [login2:127955:0]          mpool.c:43   UCX  WARN  object 0x2aff40cf3fc0 was not returned to mpool ucp_am_bufs
[1611830459.506448] [login2:127954:0]          mpool.c:43   UCX  WARN  object 0x2af74994afc0 was not returned to mpool ucp_am_bufs
^C^C(py3.8) [fabiobz@login2.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ htop -u fabiobz
fzeiser commented 3 years ago

So, here the latest results:

Final part of the output screen for diver on NuclearBit.

Total log-likelihood: -98687.484

0, 105: LogLike: -98687.484
0, 105: pointID: 105
0, 105: MPIrank: 0
Total log-likelihood: -747255.12

0, 105: LogLike: -747255.12
0, 105: pointID: 105
0, 105: MPIrank: 0
 =============================
 Number of civilisations:   1
 Best final vector:   0.59316962200323409       0.23693094182044908       0.25139732878193072       0.92610804446128703       0.85300449678139900        7.3303319158575964E-002  0.22891069697816635       0.44147227512403214       0.97552919102304880       0.76517830515739937       0.70755640175378609       0.85544590077746074       0.87329605199479787       0.75578007719621954       0.67124449566223388       0.63825974634857174       0.43559964516341404       0.77788819559293121        8.7104682688863094E-002  0.23344629862184224       0.61485082727911233       0.98532603580493405     
 Value at best final vector:    559615.70340478991     
 Total Function calls:          210
 Total seconds for process   0:      14.45
 Total seconds for process   1:      14.48
Diver run finished!
ScannerBit is waiting for all MPI processes to report their shutdown condition...
Diver run finished!

GAMBIT has finished successfully!

Calling MPI_Finalize...

"Funnily" enough the log message finished with:

--<>--<>--<>--<>--<>--<>--<>--
(Fri Jan 29 11:19:21 2021)(15.8083 [s])(Rank 0)[Default]:
Calling MPI_Finalize...
--<>--<>--<>--<>--<>--<>--<>--
(Fri Jan 29 11:19:21 2021)(15.9657 [s])(Rank 0)[Default]:
MPI successfully finalized!
--<>--<>--<>--<>--<>--<>--<>--
fzeiser commented 3 years ago

Trying to run spatan [after rebuilding with ExampleBit_A]. Note that I get the error even though I delete the run folder in advance:

 FATAL ERROR

GAMBIT has exited with fatal exception: GAMBIT error
ERROR: A problem has occurred in the printer utilities.
Failed to open existing HDF5 file, then failed to create new one! (/cluster/projects/nn9464k/progs/gambit_np/runs/spartan/samples//results.hdf5). The file may exist but be unreadable. You can check this by trying to inspect it with the 'h5ls' command line tool.
Raised at: line 236 in function hid_t Gambit::Printers::HDF5::openFile(const string&, bool, bool&, char) of /cluster/projects/nn9464k/progs/gambit_np/Printers/src/printers/hdf5printer/hdf5tools.cpp.
Calling MPI_Finalize...

GAMBIT has finished successfully!

Calling MPI_Finalize...

fzeiser commented 3 years ago

@anderkve : After the test with spatan I went back to to NuclearBit:

I can sucessfully finish the run if I select printer: none, but in contrast to spartan it will not finish for for cout with diver and with the current compilation it the printer: cout works with both diver and random. So probably it's really an issue "just" with the hdf5 printer/compilation/...!

I'm very sorry for all the fuss. I promise that I tried to keep things clean and reproducible, but obviously I didn't manage.

fzeiser commented 3 years ago

Eventhough I don't know why, I may well be related to my HDF5 installation (and h5py). At least I see that I can't run primitive examples like this with the current combination of HDF5 and h5py.

Just checked and this is still true. Do you have an idea on what exactly on hdf5 I could test?

anderkve commented 3 years ago

Hmm, this is very strange. Good thing it's narrowed down to hdf5/compilation at least.

First, note that we don't actually need h5py to run GAMBIT, so if that module puts any constraints on what versions of the hdf5 module you can use, I'd try to not load it.

So given that you now use only GNU-versions of the modules, I guess you are running the HDF5/1.10.6-gompi-2020a module? ( I'm looking at the list here: https://documentation.sigma2.no/software/installed_software/fram_modules.html ) You could perhaps try to use a different version? On the previous two clusters we've used GAMBIT on, we have used hdf5 modules named hdf5/1.8.20 (pretty sure this was an Intel-compiled one) and hdf5/1.10.4--intelmpi--2018--binary. I notice the hdf5 v1.8.20 is missing on Fram, but perhaps try out one of the 1.8.19 ones? (I'm guessing the foss versions are GNU-compiled versions?)

I will keep looking in GAMBIT to see if I can think of anything. I'm really confused by this one...

fzeiser commented 3 years ago

Hmm, this is very strange. Good thing it's narrowed down to hdf5/compilation at least.

First, note that we don't actually need h5py to run GAMBIT, so if that module puts any constraints on what versions of the hdf5 module you can use, I'd try to not load it.

Ehm, don't know, but I guess we need at least hdf5 working, right? I'll check once again that h5py is currently working in it's principal features... -> Checked: It seems to work. for some primitive example.

So given that you now use only GNU-versions of the modules, I guess you are running the HDF5/1.10.6-gompi-2020a module? ( I'm looking at the list here: https://documentation.sigma2.no/software/installed_software/fram_modules.html ) You could perhaps try to use a different version? On the previous two clusters we've used GAMBIT on, we have used hdf5 modules named hdf5/1.8.20 (pretty sure this was an Intel-compiled one) and hdf5/1.10.4--intelmpi--2018--binary. I notice the hdf5 v1.8.20 is missing on Fram, but perhaps try out one of the 1.8.19 ones? (I'm guessing the foss versions are GNU-compiled versions?)

I've been writing a lot, but it you look at the latest CMakeCache, it contains the lines:

//HDF5 C Wrapper compiler.  Used only to detect HDF5 compile flags.
HDF5_C_COMPILER_EXECUTABLE:FILEPATH=/cluster/projects/nn9464k/progs/easybuild/software/HDF5/1.10.6-GCC-9.3.0/bin/h5cc

which show that I use a (recompiled) version of HDF5/1.10.6 that I compiled with GCC-9.3.0. The precompiled versions on fram have the parallel option, which I thought you said will not work. So I just recompiled it; As I mentioned the "basic" functionality works with it. I can easily try to use hdf5/1.8.20 and see if it makes a difference.

I will keep looking in GAMBIT to see if I can think of anything. I'm really confused by this one...

Thanks, and sorry for the work!

anderkve commented 3 years ago

Ehm, don't know, but I guess we need at least hdf5 working, right?

Yep. And given that h5py seems to work, the problem almost certainly is not connected to the h5py module. Thanks for checking again.

which show that I use a (recompiled) version of HDF5/1.10.6 that I compiled with GCC-9.3.0.

Ah, of course, sorry! :P I forgot that you had recompiled this yourself to get the serial option.

The precompiled versions on fram have the parallel option, which I thought you said will not work.

At least we've had trouble getting this to work on other clusters -- then the serial version has been the solution. (Which also makes sense, since the hdf5 printer in GAMBIT only writes serially.) But of course, every new cluster has it's own quirks...

fzeiser commented 3 years ago

which show that I use a (recompiled) version of HDF5/1.10.6 that I compiled with GCC-9.3.0. The precompiled versions on fram have the parallel option, which I thought you said will not work. So I just recompiled it; As I mentioned the "basic" functionality works with it. I can easily try to use hdf5/1.8.20 and see if it makes a difference.

Unfortunately I get the same behavior with hdf5/1.8.20, which I had just compiles in serial mode for fram.

I double checked h5py. There is a test suit, but they all run (albeit, after installing pytest-mpi :), a known issue)

import h5py
h5py.run_tests()

I reinstalled h5py to make sure it's installed based on the correct hdf5 version etc. But no change there, either.

But of course, every new cluster has it's own quirks...

Yes :disappointed:

fzeiser commented 3 years ago

Btw: I get a lot of warnings of this type when compiling with gcc:

In file included from /cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/mpicxx.h:277,
                 from /cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/mpi.h:2868,
                 from /cluster/projects/nn9464k/progs/gambit_np/Utils/include/gambit/Utils/mpiwrapper.hpp:53,
                 from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/baseprintermanager.hpp:25,
                 from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/printer_id_tools.hpp:18,
                 from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/basebaseprinter.hpp:46,
                 from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/baseprinter.hpp:29,
                 from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/printers/sqlitereader.hpp:22,
                 from /cluster/projects/nn9464k/progs/gambit_np/Printers/src/printers/sqliteprinter/retrieve_overloads.cpp:20:
/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/op_inln.h: In member function ‘virtual void MPI::Op::Init(void (*)(const void*, void*, int, const MPI::Datatype&), bool)’:
/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/op_inln.h:121:46: warning: cast between incompatible function types from ‘void (*)(void*, void*, int*, ompi_datatype_t**, void (*)(void*, void*, int*, ompi_datatype_t**))’ to ‘void (*)(void*, void*, int*, ompi_datatype_t**)’ [-Wcast-function-type]
  121 |     (void)MPI_Op_create((MPI_User_function*) ompi_mpi_cxx_op_intercept,
      |                                              ^~~~~~~~~~~~~~~~~~~~~~~~~
/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/op_inln.h:123:59: warning: cast between incompatible function types from ‘void (*)(const void*, void*, int, const MPI::Datatype&)’ to ‘void (*)(void*, void*, int*, ompi_datatype_t**)’ [-Wcast-function-type]
  123 |     ompi_op_set_cxx_callback(mpi_op, (MPI_User_function*) func);
fzeiser commented 3 years ago

Wait, wait wait: I think it has just finished as it should! :rocket: :astonished: :grinning:

I think following change made the deal: Compiling h5py with the HDF5 library version that I want/have. Then recompiling gambit with this. I still can't believe :).

I don't think that there is any big advantage of a more up-to-date hdf5 version than 1.8.20, is there? So I'll just leave the configs as they are now, now that it works :)

fzeiser commented 3 years ago

Thank you very much for your help @anderkve!

anderkve commented 3 years ago

Great stuff! Thanks a lot for figuring this out!