Closed fzeiser closed 3 years ago
It might be wrong to set the compilers to mpiicc
instead of icc
(...) but not sure whether this is the reason for this. I initially did this when i had the problems in #7, but as I found out, that was probably not the issue / not necessary.
Any ideas @anderkve?
E.g. what version of HDF5 do you use on the cluster you run gambit?
Recompiled with icc
instead of mpiicc
, but as somehow expected, that did not help.
Interesting :)
Attaching also the CMakeCache.txt
with the relevant section on the HDF5 library:
//HDF5 C Wrapper compiler. Used only to detect HDF5 compile flags.
HDF5_C_COMPILER_EXECUTABLE:FILEPATH=/cluster/software/HDF5/1.10.6-iimpi-2020a/bin/h5pcc
//Path to a library.
HDF5_C_LIBRARY_dl:FILEPATH=/usr/lib64/libdl.so
//Path to a library.
HDF5_C_LIBRARY_hdf5:FILEPATH=/cluster/software/HDF5/1.10.6-iimpi-2020a/lib/libhdf5.so
//Path to a library.
HDF5_C_LIBRARY_iomp5:FILEPATH=/cluster/software/imkl/2020.1.217-iimpi-2020a/lib/intel64/libiomp5.so
//Path to a library.
HDF5_C_LIBRARY_m:FILEPATH=/usr/lib64/libm.so
//Path to a library.
HDF5_C_LIBRARY_pthread:FILEPATH=/usr/lib64/libpthread.so
//Path to a library.
HDF5_C_LIBRARY_sz:FILEPATH=/cluster/software/Szip/2.1.1-GCCcore-9.3.0/lib/libsz.so
//Path to a library.
HDF5_C_LIBRARY_z:FILEPATH=/cluster/software/zlib/1.2.11-GCCcore-9.3.0/lib/libz.so
//HDF5 file differencing tool.
HDF5_DIFF_EXECUTABLE:FILEPATH=/cluster/software/HDF5/1.10.6-iimpi-2020a/bin/h5diff
//The directory containing a CMake configuration file for HDF5.
HDF5_DIR:PATH=HDF5_DIR-NOTFOUND
So the mpiicc
vs icc
choice should not be the problem, I think. On the current cluster where I run GAMBIT I use icc
(etc), and I'm pretty sure that's what we've done on other clusters as well.
A more likely source of problem is the fact that the hdf5 library can be built in parallel or serial mode. It looks like the version you are using on fram is the parallel one.
The hdf5 output in GAMBIT is serial (everything is written to a single hdf5 file), and we have had problems with this system running with the parallel build of hdf5. (We've had other problems as well, but that's a different story... :P ) Is there an option to use the serial hdf5 on fram? Or if not, perhaps you can build it yourself?
On the cluster I currently run GAMBIT, I use hdf5 1.8.20 (serial), and on my laptop I use hdf5 1.10.6+repack-2 (serial).
One more thing: If you are running with many MPI processes, and each parameter point is very fast, the hdf5 file writing can become a bottleneck. In this case it can help to increase the buffer length of the GAMBIT Printer system. Just add a buffer_length
option in the list of hdf5 printer options in the yaml file. (The default value for the hdf5 printer is buffer_length: 1000
)
The hdf5 output in GAMBIT is serial (everything is written to a single hdf5 file), and we have had problems with this system running with the parallel build of hdf5. (We've had other problems as well, but that's a different story... :P ) Is there an option to use the serial hdf5 on fram? Or if not, perhaps you can build it yourself?
Thanks, this confirms my suspicion. I currently use a parallel build of hdf5. I will try to compile it in serial mode myself (it doesn't seem to be provided as a "standard"). I just really wanted to hear with you what system you use before I sent a variable amount (hopefully, but not necessarily short) of time on getting the prerequirements for the hdf compilation :).
Yeah, makes sense. :) Hope it's not too horrible to get working...
I build HDF5 with out the parallel
option (setting usempi=False
in the easybuild
toolchain
). I also checked with h5cc -showconfig
that the Parallel HDF5
feature is disabled.
However, now I receive following error:
(py3.8) [fabiobz@login2.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ ./gambit -rf yaml_files/NuclearBit_demo.yaml
GAMBIT 1.5.0
http://gambit.hepforge.org
Abort(2664079) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(904)...............:
MPIDI_OFI_mpi_init_hook(1421):
MPIDU_bc_table_create(311)...:
Logger was never initialised! Creating default log messenger...
Gambit has encountered an uncaught error during initialisation.
Check the output logs for details.
(Check your yaml file if you can't recall where the logs are.)
what(): GAMBIT error
ERROR: A problem has been raised by one of the utility codes.
Error creating Comm object (wrapper for MPI communicator)! MPI has not been initialised!
Raised at: line 59 in function Gambit::GMPI::Comm::Comm() of /cluster/projects/nn9464k/progs/gambit_np/Utils/src/mpiwrapper.cpp.
Something that I find odd here is that the error comes up even if I only request the cout
logger in gambit. Does that make sense / could this still be related to the hdf5
issue? I realized that I didn't check whether there is a in-between mode, building with mpi support but without parallel
(though I wouldn't know what that should mean :).
Here's the CMakeCache.txt.
[+ As suspected from the HDF5 path used previously (very first comments), the previous HDF5 version I used had Parallel HDF5: yes
]
Eventhough I don't know why, I may well be related to my HDF5
installation (and h5py
). At least I see that I can't run primitive examples like this with the current combination of HDF5
and h5py
.
Reinstalled h5py
(and unloaded the module before). Now both the C and python example from hdf5 work. Will try to recompile gambit again (...).
Unfortunately, it didn't help to have the updated HDF5
and h5py
libraries. -- Though, as I said, the error looks somehow different. Any ideas on what I can try to pinpoint the error?
About the hdf5 serial/parallel: It is perhaps equivalent to adjusting the usempi
setting like you did, but I noticed that the "official" hdf5 configure option is --enable-parallel
: https://support.hdfgroup.org/HDF5/faq/parallel.html
Though, as you say, the fact that you get the above error even with the cout
printer hints that the current problem is no longer related to hdf5... I don't think I've ever seen the error above before. :P
I noticed the above error happens when you call GAMBIT directly as ./gambit -rf yaml_files/NuclearBit_demo.yaml
. What happens if you call it with mpirun
or mpiexec
? E.g. mpiexec -np 2 ./gambit -rf yaml_files/NuclearBit_demo.yaml
.
Also, I assume silly mpi tests like just doing mpiexec -np 2 echo "hello"
works as expected?
Also, from the CMakeCache.txt
file it looks like a lot of the loaded packages have been compiled with gcc
(the string GCCcore-9.3.0
appears in a lot of paths), including Python 3.8. It's probably not related, but given that you probably have loaded some Intel compiler module, I would have expected the module system to use packages compiled with the Intel compilers...
One fallback option you could try is of course to not use the Intel compilers, but rather plain ol' gcc. Would expect slightly worse performance, but might be less hassle to get working.
About the hdf5 serial/parallel: It is perhaps equivalent to adjusting the
usempi
setting like you did, but I noticed that the "official" hdf5 configure option is--enable-parallel
: https://support.hdfgroup.org/HDF5/faq/parallel.html
As far as I read hdf5
would/should for newer versions recognize automatically whether one has mpi support and default the flags accordingly.
Anyhow, from the easybuild config i get the following:
# MPI and C++ support enabled requires --enable-unsupported, because this is untested by HDF5
# also returns False if MPI is not supported by this toolchain
if self.toolchain.options.get('usempi', None):
self.cfg.update('configopts', "--enable-unsupported --enable-parallel")
mpich_mpi_families = [toolchain.INTELMPI, toolchain.MPICH, toolchain.MPICH2, toolchain.MVAPICH2]
if self.toolchain.mpi_family() in mpich_mpi_families:
self.cfg.update('buildopts', 'CXXFLAGS="$CXXFLAGS -DMPICH_IGNORE_CXX_SEEK"')
# Skip MPI cxx extensions to avoid hard dependency
if self.toolchain.mpi_family() == toolchain.OPENMPI:
self.cfg.update('buildopts', 'CXXFLAGS="$CXXFLAGS -DOMPI_SKIP_MPICXX"')
else:
self.cfg.update('configopts', "--disable-parallel")
As I checked with h5cc -showconfig
that the Parallel HDF5
feature is disabled (as expected from my config file).
Though, as you say, the fact that you get the above error even with the
cout
printer hints that the current problem is no longer related to hdf5... I don't think I've ever seen the error above before. :PI noticed the above error happens when you call GAMBIT directly as
./gambit -rf yaml_files/NuclearBit_demo.yaml
. What happens if you call it withmpirun
ormpiexec
? E.g.mpiexec -np 2 ./gambit -rf yaml_files/NuclearBit_demo.yaml
.
The error is just slightly different, but includes also something about Python :/
(py3.8) [fabiobz@login3.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ mpiexec -np 2 ./gambit -rf yaml_files/NuclearBit_demo.yaml
Fatal Python error: init_sys_streams: <stdin> is a directory, cannot continue
Python runtime state: core initialized
Current thread 0x00002aff7b610500 (most recent call first):
<no Python frame>
GAMBIT 1.5.0
http://gambit.hepforge.org
Logger was never initialised! Creating default log messenger...
terminate called after throwing an instance of 'Gambit::exception'
what(): GAMBIT error
ERROR: A problem has been raised by one of the utility codes.
Error creating Comm object (wrapper for MPI communicator)! MPI has not been initialised!
Raised at: line 59 in function Gambit::GMPI::Comm::Comm() of /cluster/projects/nn9464k/progs/gambit_np/Utils/src/mpiwrapper.cpp.
[...]
Also, I assume silly mpi tests like just doing
mpiexec -np 2 echo "hello"
works as expected?
Yes, they do
Also, from the
CMakeCache.txt
file it looks like a lot of the loaded packages have been compiled withgcc
(the stringGCCcore-9.3.0
appears in a lot of paths), including Python 3.8. It's probably not related, but given that you probably have loaded some Intel compiler module, I would have expected the module system to use packages compiled with the Intel compilers...One fallback option you could try is of course to not use the Intel compilers, but rather plain ol' gcc. Would expect slightly worse performance, but might be less hassle to get working.
Yes, I can give it a try. The reason that I have several things compiles with gcc
and others with intel
is that I tried to use precompiled modules existing on fram as far as possible. I can try to find a full set of gcc
modules and recompile gambit with it. Will report on the status.
Finally I'm getting there starting to get there CMakeCache.txt
hdf5
without the parallel optionIt run's almost as I'd like. Either using ./gambit [...]
, or mpiexec -np 2 [...]
gambit will take forever(?) in Calling MPI_Finalize...
. This is regardless of whether I use the hdf5
printer. I killed the process after 2 or 3 minutes.
Here the output (when asking for the cout printer only):
(py3.8) [fabiobz@login3.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ ./gambit -rf yaml_files/NuclearBit_demo.yaml
GAMBIT 1.5.0
http://gambit.hepforge.org
Descriptions are missing for the following models:
GenericModel5
GenericModel10
GenericModel15
GenericModel20
GSFModel20
GSF_GLO_CT_Model20
GSF_EGLO_CT_Model20
GSF_MGLO_CT_Model20
GSF_GH_CT_Model20
GSF_constantM1
NLDModelCT_and_discretes
NLDModelBSFG_and_discretes
Please add descriptions of these to /cluster/projects/nn9464k/progs/gambit_np/config/models.dat
Descriptions are missing for the following capabilities:
GSFModel20_parameters
GSF_EGLO_CT_Model20_parameters
GSF_GH_CT_Model20_parameters
GSF_GLO_CT_Model20_parameters
GSF_MGLO_CT_Model20_parameters
GSF_constantM1_parameters
GenericModel10_parameters
GenericModel15_parameters
GenericModel20_parameters
GenericModel5_parameters
NLDModelBSFG_and_discretes_parameters
NLDModelCT_and_discretes_parameters
gledeliBE_1_0_init
gledeliBE_get_results
gledeliBE_run
gledeliBE_set_model_names
gledeliBE_set_model_pars
gledeliLogLike
gledeliResults
zeroLogLike
Please add descriptions of these to /cluster/projects/nn9464k/progs/gambit_np/config/capabilities.dat
Starting GAMBIT
----------
Running in MPI-parallel mode with 1 processes
----------
Running with 1 OpenMP threads per MPI process (set by the environment variable OMP_NUM_THREADS).
YAML file: yaml_files/NuclearBit_demo.yaml
Initialising logger... log_debug_messages = true; log messages tagged as 'Debug' WILL be logged.
WARNING: This may lead to very large log files!
Resolving dependencies and backend requirements. Hang tight...
...done!
Starting scan.
ScannerBit is waiting for all MPI processes to join the scan...
All processes ready!
Entering random sampler.
number of points to calculate: 5
MPI process rank: 0
[...]
0, 5: pointID: 5
0, 5: MPIrank: 0
ScannerBit is waiting for all MPI processes to report their shutdown condition...
GAMBIT has finished successfully!
Calling MPI_Finalize...
MPIEXEC_EXECUTABLE
. But the same error appears when I use he correct one.MPIrank: 0
in the output. I would have expected ranks 0
and 1
when using -np 2
. --<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.28042 [s])(Rank 0)[Default,Core][Debug]:
Returning control to ScannerBit
--<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.29138 [s])(Rank 0)[Default]:
GAMBIT run completed successfully.
--<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.29148 [s])(Rank 0)[Default,Core][Info]:
NO_MORE_MESSAGES code broadcast to all processes
--<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.29151 [s])(Rank 0)[Default,Core][Debug]:
Receiving all shutdown messages
--<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.29155 [s])(Rank 0)[Default,Core][Info]:
Cleaning up shutdown message send buffers
--<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.29158 [s])(Rank 0)[Default]:
All shutdown messages successfully Recv'd on this process!
--<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.29161 [s])(Rank 0)[Default]:
Calling MPI_Finalize...
--<>--<>--<>--<>--<>--<>--<>--
(Mon Jan 25 22:09:39 2021)(5.50061 [s])(Rank 0)[Default]:
MPI successfully finalized!
--<>--<>--<>--<>--<>--<>--<>--
Interestingly, I find only MPIrank: 0 in the output. I would have expected ranks 0 and 1 when using -np 2
Thanks -- I remember spotting this issue in GAMBIT a little while ago while, but we haven't had time to fix it yet. It seems to only be some problem with the cout printer, the logs are still correct and shows how parameter points where actually distributed across the MPI processes. (The hdf5 printer should work fine.)
gambit will take forever(?) in Calling MPI_Finalize...
This sounds familiar. Will have a look in the main GAMBIT repo and emails to see if this has come up before.
Which scanner are you using when you see the MPI_Finalize
problem?
Which scanner are you using when you see the
MPI_Finalize
problem?
I use random
.
Interestingly, I find only MPIrank: 0 in the output. I would have expected ranks 0 and 1 when using -np 2
Thanks -- I remember spotting this issue in GAMBIT a little while ago while, but we haven't had time to fix it yet. It seems to only be some problem with the cout printer, the logs are still correct and shows how parameter points where actually distributed across the MPI processes. (The hdf5 printer should work fine.)
Yes, probably it does - if gambit was to finalize and then write out the hdf5
file...
Trying to run a different scanner, diver
, but getting following error:
Starting GAMBIT
----------
Running in MPI-parallel mode with 2 processes
----------
Running with 2 OpenMP threads per MPI process (set by the environment variable OMP_NUM_THREADS).
YAML file: yaml_files/NuclearBit_demo.yaml
Initialising logger... log_debug_messages = true; log messages tagged as 'Debug' WILL be logged.
WARNING: This may lead to very large log files!
Resolving dependencies and backend requirements. Hang tight...
...done!
Starting scan.
ScannerBit is waiting for all MPI processes to join the scan...
All processes ready!
FATAL ERROR
GAMBIT has exited with fatal exception: GAMBIT error
ERROR: A problem has been raised by ScannerBit.
Cannot load /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/lib/libscanner_diver_1.0.4.so: /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/installed/diver/1.0.4/lib/libdiver.so: undefined symbol: __svml_exp2
Raised at: line 136 in function const std::map<Gambit::type_index, void*>& Gambit::Scanner::Plugins::Plugin_Interface_Base::initPlugin(const string&, const string&, const plug_args& ...) [with plug_args = {unsigned int, Gambit::Scanner::Factory_Base}; std::string = std::__cxx11::basic_string<char>] of /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/include/gambit/ScannerBit/plugin_interface.hpp.
FATAL ERROR
GAMBIT has exited with fatal exception: GAMBIT error
ERROR: A problem has been raised by ScannerBit.
Cannot load /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/lib/libscanner_diver_1.0.4.so: /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/installed/diver/1.0.4/lib/libdiver.so: undefined symbol: __svml_exp2
Raised at: line 136 in function const std::map<Gambit::type_index, void*>& Gambit::Scanner::Plugins::Plugin_Interface_Base::initPlugin(const string&, const string&, const plug_args& ...) [with plug_args = {unsigned int, Gambit::Scanner::Factory_Base}; std::string = std::__cxx11::basic_string<char>] of /cluster/projects/nn9464k/progs/gambit_np/ScannerBit/include/gambit/ScannerBit/plugin_interface.hpp.
Calling MPI_Finalize...
[1611830459.506291] [login2:127955:0] mpool.c:43 UCX WARN object 0x2aff40cf3fc0 was not returned to mpool ucp_am_bufs
[1611830459.506448] [login2:127954:0] mpool.c:43 UCX WARN object 0x2af74994afc0 was not returned to mpool ucp_am_bufs
^C^C(py3.8) [fabiobz@login2.FRAM /cluster/projects/nn9464k/progs/gambit_np]$ htop -u fabiobz
cmake ..
step inbetween: no changes[ 50%] Performing build step for 'multinest_3.11'
f951: Fatal Error: Reading module ‘nested’ at line 1 column 2: Unexpected EOF
I will try to fix this before I try anything else. My build system was not as reproducible as I thought :disappointed:
So, here the latest results:
f951: Fatal Error: Reading module ‘nested’
) is gone (as expected)hdf5
file is produced -- regardless of whether I set the buffer_length=1
or not.Final part of the output screen for diver
on NuclearBit
.
Total log-likelihood: -98687.484
0, 105: LogLike: -98687.484
0, 105: pointID: 105
0, 105: MPIrank: 0
Total log-likelihood: -747255.12
0, 105: LogLike: -747255.12
0, 105: pointID: 105
0, 105: MPIrank: 0
=============================
Number of civilisations: 1
Best final vector: 0.59316962200323409 0.23693094182044908 0.25139732878193072 0.92610804446128703 0.85300449678139900 7.3303319158575964E-002 0.22891069697816635 0.44147227512403214 0.97552919102304880 0.76517830515739937 0.70755640175378609 0.85544590077746074 0.87329605199479787 0.75578007719621954 0.67124449566223388 0.63825974634857174 0.43559964516341404 0.77788819559293121 8.7104682688863094E-002 0.23344629862184224 0.61485082727911233 0.98532603580493405
Value at best final vector: 559615.70340478991
Total Function calls: 210
Total seconds for process 0: 14.45
Total seconds for process 1: 14.48
Diver run finished!
ScannerBit is waiting for all MPI processes to report their shutdown condition...
Diver run finished!
GAMBIT has finished successfully!
Calling MPI_Finalize...
"Funnily" enough the log message finished with:
--<>--<>--<>--<>--<>--<>--<>--
(Fri Jan 29 11:19:21 2021)(15.8083 [s])(Rank 0)[Default]:
Calling MPI_Finalize...
--<>--<>--<>--<>--<>--<>--<>--
(Fri Jan 29 11:19:21 2021)(15.9657 [s])(Rank 0)[Default]:
MPI successfully finalized!
--<>--<>--<>--<>--<>--<>--<>--
Trying to run spatan
[after rebuilding with ExampleBit_A
]. Note that I get the error even though I delete the run
folder in advance:
FATAL ERROR
GAMBIT has exited with fatal exception: GAMBIT error
ERROR: A problem has occurred in the printer utilities.
Failed to open existing HDF5 file, then failed to create new one! (/cluster/projects/nn9464k/progs/gambit_np/runs/spartan/samples//results.hdf5). The file may exist but be unreadable. You can check this by trying to inspect it with the 'h5ls' command line tool.
Raised at: line 236 in function hid_t Gambit::Printers::HDF5::openFile(const string&, bool, bool&, char) of /cluster/projects/nn9464k/progs/gambit_np/Printers/src/printers/hdf5printer/hdf5tools.cpp.
Calling MPI_Finalize...
results.hdf5
, which I cannot open by h5ls
or sohdf5
printer by cout
, GAMBIT will finish successfully with:
ScannerBit is waiting for all MPI processes to report their shutdown condition...
GAMBIT has finished successfully!
Calling MPI_Finalize...
@anderkve : After the test with spatan
I went back to to NuclearBit
:
I can sucessfully finish the run if I select printer: none
, but in contrast to and with the current compilation it the spartan
it will not finish for for cout
with diver
printer: cout
works with both diver
and random
. So probably it's really an issue "just" with the hdf5 printer/compilation/...!
I'm very sorry for all the fuss. I promise that I tried to keep things clean and reproducible, but obviously I didn't manage.
Eventhough I don't know why, I may well be related to my
HDF5
installation (andh5py
). At least I see that I can't run primitive examples like this with the current combination ofHDF5
andh5py
.
Just checked and this is still true. Do you have an idea on what exactly on hdf5 I could test?
Hmm, this is very strange. Good thing it's narrowed down to hdf5/compilation at least.
First, note that we don't actually need h5py
to run GAMBIT, so if that module puts any constraints on what versions of the hdf5
module you can use, I'd try to not load it.
So given that you now use only GNU-versions of the modules, I guess you are running the HDF5/1.10.6-gompi-2020a
module? ( I'm looking at the list here: https://documentation.sigma2.no/software/installed_software/fram_modules.html ) You could perhaps try to use a different version? On the previous two clusters we've used GAMBIT on, we have used hdf5 modules named hdf5/1.8.20
(pretty sure this was an Intel-compiled one) and hdf5/1.10.4--intelmpi--2018--binary
. I notice the hdf5 v1.8.20 is missing on Fram, but perhaps try out one of the 1.8.19 ones? (I'm guessing the foss
versions are GNU-compiled versions?)
I will keep looking in GAMBIT to see if I can think of anything. I'm really confused by this one...
Hmm, this is very strange. Good thing it's narrowed down to hdf5/compilation at least.
First, note that we don't actually need
h5py
to run GAMBIT, so if that module puts any constraints on what versions of thehdf5
module you can use, I'd try to not load it.
Ehm, don't know, but I guess we need at least hdf5 working, right? I'll check once again that h5py
is currently working in it's principal features...
-> Checked: It seems to work. for some primitive example.
So given that you now use only GNU-versions of the modules, I guess you are running the
HDF5/1.10.6-gompi-2020a
module? ( I'm looking at the list here: https://documentation.sigma2.no/software/installed_software/fram_modules.html ) You could perhaps try to use a different version? On the previous two clusters we've used GAMBIT on, we have used hdf5 modules namedhdf5/1.8.20
(pretty sure this was an Intel-compiled one) andhdf5/1.10.4--intelmpi--2018--binary
. I notice the hdf5 v1.8.20 is missing on Fram, but perhaps try out one of the 1.8.19 ones? (I'm guessing thefoss
versions are GNU-compiled versions?)
I've been writing a lot, but it you look at the latest CMakeCache
, it contains the lines:
//HDF5 C Wrapper compiler. Used only to detect HDF5 compile flags.
HDF5_C_COMPILER_EXECUTABLE:FILEPATH=/cluster/projects/nn9464k/progs/easybuild/software/HDF5/1.10.6-GCC-9.3.0/bin/h5cc
which show that I use a (recompiled) version of HDF5/1.10.6
that I compiled with GCC-9.3.0
. The precompiled versions on fram have the parallel option, which I thought you said will not work. So I just recompiled it; As I mentioned the "basic" functionality works with it. I can easily try to use hdf5/1.8.20
and see if it makes a difference.
I will keep looking in GAMBIT to see if I can think of anything. I'm really confused by this one...
Thanks, and sorry for the work!
Ehm, don't know, but I guess we need at least hdf5 working, right?
Yep. And given that h5py
seems to work, the problem almost certainly is not connected to the h5py module. Thanks for checking again.
which show that I use a (recompiled) version of HDF5/1.10.6 that I compiled with GCC-9.3.0.
Ah, of course, sorry! :P I forgot that you had recompiled this yourself to get the serial option.
The precompiled versions on fram have the parallel option, which I thought you said will not work.
At least we've had trouble getting this to work on other clusters -- then the serial version has been the solution. (Which also makes sense, since the hdf5 printer in GAMBIT only writes serially.) But of course, every new cluster has it's own quirks...
which show that I use a (recompiled) version of HDF5/1.10.6 that I compiled with GCC-9.3.0. The precompiled versions on fram have the parallel option, which I thought you said will not work. So I just recompiled it; As I mentioned the "basic" functionality works with it. I can easily try to use hdf5/1.8.20 and see if it makes a difference.
Unfortunately I get the same behavior with hdf5/1.8.20
, which I had just compiles in serial mode for fram.
I double checked h5py
. There is a test suit, but they all run (albeit, after installing pytest-mpi
:), a known issue)
import h5py
h5py.run_tests()
I reinstalled h5py to make sure it's installed based on the correct hdf5
version etc. But no change there, either.
But of course, every new cluster has it's own quirks...
Yes :disappointed:
Btw: I get a lot of warnings of this type when compiling with gcc:
In file included from /cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/mpicxx.h:277,
from /cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/mpi.h:2868,
from /cluster/projects/nn9464k/progs/gambit_np/Utils/include/gambit/Utils/mpiwrapper.hpp:53,
from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/baseprintermanager.hpp:25,
from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/printer_id_tools.hpp:18,
from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/basebaseprinter.hpp:46,
from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/baseprinter.hpp:29,
from /cluster/projects/nn9464k/progs/gambit_np/Printers/include/gambit/Printers/printers/sqlitereader.hpp:22,
from /cluster/projects/nn9464k/progs/gambit_np/Printers/src/printers/sqliteprinter/retrieve_overloads.cpp:20:
/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/op_inln.h: In member function ‘virtual void MPI::Op::Init(void (*)(const void*, void*, int, const MPI::Datatype&), bool)’:
/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/op_inln.h:121:46: warning: cast between incompatible function types from ‘void (*)(void*, void*, int*, ompi_datatype_t**, void (*)(void*, void*, int*, ompi_datatype_t**))’ to ‘void (*)(void*, void*, int*, ompi_datatype_t**)’ [-Wcast-function-type]
121 | (void)MPI_Op_create((MPI_User_function*) ompi_mpi_cxx_op_intercept,
| ^~~~~~~~~~~~~~~~~~~~~~~~~
/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/include/openmpi/ompi/mpi/cxx/op_inln.h:123:59: warning: cast between incompatible function types from ‘void (*)(const void*, void*, int, const MPI::Datatype&)’ to ‘void (*)(void*, void*, int*, ompi_datatype_t**)’ [-Wcast-function-type]
123 | ompi_op_set_cxx_callback(mpi_op, (MPI_User_function*) func);
Wait, wait wait: I think it has just finished as it should! :rocket: :astonished: :grinning:
I think following change made the deal: Compiling h5py
with the HDF5 library version that I want/have. Then recompiling gambit with this. I still can't believe :).
I don't think that there is any big advantage of a more up-to-date hdf5 version than 1.8.20
, is there? So I'll just leave the configs as they are now, now that it works :)
Thank you very much for your help @anderkve!
Great stuff! Thanks a lot for figuring this out!
Trying to run
gambit
on fram I get following error:compiler setting: