Closed rrsettgast closed 2 years ago
The HDF5 1.12 release broke backwards compatibility, but includes a macro layer for back-compat.
https://www.hdfgroup.org/2020/03/release-of-hdf5-1-12-0-newsletter-172/
We will need to specify something along the lines of –DH5_USE_110_API:BOOL=ON
as a CMake option (probably setting it in
GeosxOptions.cmake
makes the most sense) until we update our API usage to match 1.12.x
Aside from that I'm all for the update as yeah we've encountered quite a few issues with HDF5 and the LC filesystems / filesystem configuration. Hopefully they've found ways to mitigate some of these in more current releases.
The HDF5 1.12 release broke backwards compatibility, but includes a macro layer for back-compat.
https://www.hdfgroup.org/2020/03/release-of-hdf5-1-12-0-newsletter-172/
We will need to specify something along the lines of
–DH5_USE_110_API:BOOL=ON
as a CMake option (probably setting it inGeosxOptions.cmake
makes the most sense) until we update our API usage to match 1.12.xAside from that I'm all for the update as yeah we've encountered quite a few issues with HDF5 and the LC filesystems / filesystem configuration. Hopefully they've found ways to mitigate some of these in more current releases.
Do you know if this actually affects us? We seem to pass all integrated tests.
The HDF5 1.12 release broke backwards compatibility, but includes a macro layer for back-compat. https://www.hdfgroup.org/2020/03/release-of-hdf5-1-12-0-newsletter-172/ We will need to specify something along the lines of
–DH5_USE_110_API:BOOL=ON
as a CMake option (probably setting it inGeosxOptions.cmake
makes the most sense) until we update our API usage to match 1.12.x Aside from that I'm all for the update as yeah we've encountered quite a few issues with HDF5 and the LC filesystems / filesystem configuration. Hopefully they've found ways to mitigate some of these in more current releases.Do you know if this actually affects us? We seem to pass all integrated tests.
Looking at the actual 1.12 -> 1.10 macro list: https://portal.hdfgroup.org/display/HDF5/API+Compatibility+Macros
It's relatively short and I don't think we use any of them directly in the TimeHistory functionality. So we might be fine actually.
@TotoGaz Are you building HDF5 on total machines, or using existing installations?
@XL64 Are you able to test this on P3 please? Sorry @rrsettgast I did not see this before... 😢
I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?
By the way, on the develop version of TPL/GEOSX I still have some build issue with /benchmarkReduceKernels.cpp. t build time I get:
/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double, ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed
/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double, ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed
/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double, ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed
/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double, ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed
/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double, ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed
/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double, ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed
6 errors detected in the compilation of "/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.cpp".
If I comment out the OpenMP part of that test I get on the link step:
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::subscriptViewRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:151: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::subscriptViewKernel(LvArray::ArrayView<double const, 1, 0, long, LvArray::ChaiBuffer> const&)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::rajaViewRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:163: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::RAJAViewKernel(RAJA::View<double const, RAJA::detail::LayoutBase_impl<camp::int_seq<long, 0l>, long, 0l>, double const*> const&)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::fortranSliceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:145: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::fortranSliceKernel(LvArray::ArraySlice<double const, 1, 0, long>)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::pointerRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:169: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::pointerKernel(double const*, long)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::subscriptSliceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:157: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::subscriptSliceKernel(LvArray::ArraySlice<double const, 1, 0, long>)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::fortranViewRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:139: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::fortranViewKernel(LvArray::ArrayView<double const, 1, 0, long, LvArray::ChaiBuffer> const&)'
My current solution is to only build geosx with make geosx.
I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?
Yes, this branch https://github.com/GEOSX/GEOSX/pull/1874 seems to fix the node/fetch_node
change
By the way, on the develop version of TPL/GEOSX I still have some build issue with /benchmarkReduceKernels.cpp. t build time I get: ... My current solution is to only build geosx with make geosx.
Is this a new problem or is it some old issue (like I seem to understand)?
By the way, on the develop version of TPL/GEOSX I still have some build issue with /benchmarkReduceKernels.cpp. t build time I get: ... My current solution is to only build geosx with make geosx.
Is this a new problem or is it some old issue (like I seem to understand)?
Indeed, it is unrelated to this.
I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?
Yes, this branch GEOSX/GEOSX#1874 seems to fix the
node/fetch_node
change
Ok, I am leaving for one week now. If not merged I can try it when I am back but as my develop version of TPL/Geosx does not build completly I don't expect it to work with this change (would be nice ;)).
I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?
Yes, this branch GEOSX/GEOSX#1874 seems to fix the
node/fetch_node
changeOk, I am leaving for one week now. If not merged I can try it when I am back but as my develop version of TPL/Geosx does not build completly I don't expect it to work with this change (would be nice ;)).
maybe we can update RAJA...or there is a compilation flag that we can add to allow the call.
That error seems to be an OpenMP reduce destructor being called from a device kernel which really seems like a compilation flag misconfig of some sort.
Looking further https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L24 is just being instantiated with an OpenMP policy. Which means when configuring with both OpenMP and cuda a version will be instantiated with a host-device lambda, which will be compiled by nvcc and indeed on that pass I would expect the device version of the lambda to capture an OpenMP reduce. Hmmmm.
Could sanity check this by redefining LVARRAY_HOST_DEVICE to be just __host__
(or nothing) inside the ifdef https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L84, then redefine it back to __host__ __device__
when exiting the ifdef. If that passed it would confirm that is the issue.
The OMP reduce destructor isn't decorated (as I would expect it not to be) https://github.com/LLNL/RAJA/blob/9e12cc2a1460b0523ff100ceae9b98a2446ad41a/include/RAJA/policy/openmp/reduce.hpp#L56
That error seems to be an OpenMP reduce destructor being called from a device kernel which really seems like a compilation flag misconfig of some sort.
Looking further https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L24 is just being instantiated with an OpenMP policy. Which means when configuring with both OpenMP and cuda a version will be instantiated with a host-device lambda, which will be compiled by nvcc and indeed on that pass I would expect the device version of the lambda to capture an OpenMP reduce. Hmmmm.
Could sanity check this by redefining LVARRAY_HOST_DEVICE to be just
__host__
(or nothing) inside the ifdef https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L84, then redefine it back to__host__ __device__
when exiting the ifdef. If that passed it would confirm that is the issue.The OMP reduce destructor isn't decorated (as I would expect it not to be) https://github.com/LLNL/RAJA/blob/9e12cc2a1460b0523ff100ceae9b98a2446ad41a/include/RAJA/policy/openmp/reduce.hpp#L56
This error has popped up many times before and it's pretty standard usage so I think it's something that should be fixed on RAJA's end.
As part of our effort to troubleshoot the locking problem we are seeing on LC, I built a version with updated hdf5, silo, counduit. I don't know if there is a reason NOT to do this update. Please let me know if there might be potential problems updating any of these.