GEOS-DEV / thirdPartyLibs

Repository to build the GEOSX third party libraries
3 stars 12 forks source link

update to hdf 1.12.1, silo 4.11, and conduit 0.8.2 #184

Closed rrsettgast closed 2 years ago

rrsettgast commented 2 years ago

As part of our effort to troubleshoot the locking problem we are seeing on LC, I built a version with updated hdf5, silo, counduit. I don't know if there is a reason NOT to do this update. Please let me know if there might be potential problems updating any of these.

wrtobin commented 2 years ago

The HDF5 1.12 release broke backwards compatibility, but includes a macro layer for back-compat.

https://www.hdfgroup.org/2020/03/release-of-hdf5-1-12-0-newsletter-172/

We will need to specify something along the lines of –DH5_USE_110_API:BOOL=ON as a CMake option (probably setting it in GeosxOptions.cmake makes the most sense) until we update our API usage to match 1.12.x

Aside from that I'm all for the update as yeah we've encountered quite a few issues with HDF5 and the LC filesystems / filesystem configuration. Hopefully they've found ways to mitigate some of these in more current releases.

rrsettgast commented 2 years ago

The HDF5 1.12 release broke backwards compatibility, but includes a macro layer for back-compat.

https://www.hdfgroup.org/2020/03/release-of-hdf5-1-12-0-newsletter-172/

We will need to specify something along the lines of –DH5_USE_110_API:BOOL=ON as a CMake option (probably setting it in GeosxOptions.cmake makes the most sense) until we update our API usage to match 1.12.x

Aside from that I'm all for the update as yeah we've encountered quite a few issues with HDF5 and the LC filesystems / filesystem configuration. Hopefully they've found ways to mitigate some of these in more current releases.

Do you know if this actually affects us? We seem to pass all integrated tests.

wrtobin commented 2 years ago

The HDF5 1.12 release broke backwards compatibility, but includes a macro layer for back-compat. https://www.hdfgroup.org/2020/03/release-of-hdf5-1-12-0-newsletter-172/ We will need to specify something along the lines of –DH5_USE_110_API:BOOL=ON as a CMake option (probably setting it in GeosxOptions.cmake makes the most sense) until we update our API usage to match 1.12.x Aside from that I'm all for the update as yeah we've encountered quite a few issues with HDF5 and the LC filesystems / filesystem configuration. Hopefully they've found ways to mitigate some of these in more current releases.

Do you know if this actually affects us? We seem to pass all integrated tests.

Looking at the actual 1.12 -> 1.10 macro list: https://portal.hdfgroup.org/display/HDF5/API+Compatibility+Macros

It's relatively short and I don't think we use any of them directly in the TimeHistory functionality. So we might be fine actually.

rrsettgast commented 2 years ago

@TotoGaz Are you building HDF5 on total machines, or using existing installations?

TotoGaz commented 2 years ago

@XL64 Are you able to test this on P3 please? Sorry @rrsettgast I did not see this before... 😢

XL64 commented 2 years ago

I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?

XL64 commented 2 years ago

By the way, on the develop version of TPL/GEOSX I still have some build issue with /benchmarkReduceKernels.cpp. t build time I get:

/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed

/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed

/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed

/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed

/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed

/data_local/sw/cuda/11.3.0/include/vector_types.h(421): error: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("") is not allowed

6 errors detected in the compilation of "/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.cpp".

If I comment out the OpenMP part of that test I get on the link step:

CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::subscriptViewRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:151: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::subscriptViewKernel(LvArray::ArrayView<double const, 1, 0, long, LvArray::ChaiBuffer> const&)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::rajaViewRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:163: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::RAJAViewKernel(RAJA::View<double const, RAJA::detail::LayoutBase_impl<camp::int_seq<long, 0l>, long, 0l>, double const*> const&)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::fortranSliceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:145: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::fortranSliceKernel(LvArray::ArraySlice<double const, 1, 0, long>)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::pointerRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:169: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::pointerKernel(double const*, long)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::subscriptSliceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:157: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::subscriptSliceKernel(LvArray::ArraySlice<double const, 1, 0, long>)'
CMakeFiles/benchmarkReduce.dir/benchmarkReduce.cpp.o: In function `void LvArray::benchmarking::fortranViewRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(benchmark::State&)':
/appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/LvArray/benchmarks/benchmarkReduceKernels.hpp:139: undefined reference to `LvArray::benchmarking::ReduceRAJA<RAJA::PolicyBaseT<(RAJA::Policy)4, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >::fortranViewKernel(LvArray::ArrayView<double const, 1, 0, long, LvArray::ChaiBuffer> const&)'

My current solution is to only build geosx with make geosx.

TotoGaz commented 2 years ago

I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?

Yes, this branch https://github.com/GEOSX/GEOSX/pull/1874 seems to fix the node/fetch_node change

TotoGaz commented 2 years ago

By the way, on the develop version of TPL/GEOSX I still have some build issue with /benchmarkReduceKernels.cpp. t build time I get: ... My current solution is to only build geosx with make geosx.

Is this a new problem or is it some old issue (like I seem to understand)?

XL64 commented 2 years ago

By the way, on the develop version of TPL/GEOSX I still have some build issue with /benchmarkReduceKernels.cpp. t build time I get: ... My current solution is to only build geosx with make geosx.

Is this a new problem or is it some old issue (like I seem to understand)?

Indeed, it is unrelated to this.

XL64 commented 2 years ago

I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?

Yes, this branch GEOSX/GEOSX#1874 seems to fix the node/fetch_node change

Ok, I am leaving for one week now. If not merged I can try it when I am back but as my develop version of TPL/Geosx does not build completly I don't expect it to work with this change (would be nice ;)).

rrsettgast commented 2 years ago

I built the TPL with no issue but when I try to build geosx I get : /appli_RD/LACOSTE/GEOSX/GEOSX2/src/coreComponents/dataRepository/ConduitRestart.cpp(67): error: class "conduit::Node" has no member "fetch_child" Maybe I need to test a specific branch on geosx ?

Yes, this branch GEOSX/GEOSX#1874 seems to fix the node/fetch_node change

Ok, I am leaving for one week now. If not merged I can try it when I am back but as my develop version of TPL/Geosx does not build completly I don't expect it to work with this change (would be nice ;)).

maybe we can update RAJA...or there is a compilation flag that we can add to allow the call.

wrtobin commented 2 years ago

That error seems to be an OpenMP reduce destructor being called from a device kernel which really seems like a compilation flag misconfig of some sort.

Looking further https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L24 is just being instantiated with an OpenMP policy. Which means when configuring with both OpenMP and cuda a version will be instantiated with a host-device lambda, which will be compiled by nvcc and indeed on that pass I would expect the device version of the lambda to capture an OpenMP reduce. Hmmmm.

Could sanity check this by redefining LVARRAY_HOST_DEVICE to be just __host__ (or nothing) inside the ifdef https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L84, then redefine it back to __host__ __device__ when exiting the ifdef. If that passed it would confirm that is the issue.

The OMP reduce destructor isn't decorated (as I would expect it not to be) https://github.com/LLNL/RAJA/blob/9e12cc2a1460b0523ff100ceae9b98a2446ad41a/include/RAJA/policy/openmp/reduce.hpp#L56

corbett5 commented 2 years ago

That error seems to be an OpenMP reduce destructor being called from a device kernel which really seems like a compilation flag misconfig of some sort.

Looking further https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L24 is just being instantiated with an OpenMP policy. Which means when configuring with both OpenMP and cuda a version will be instantiated with a host-device lambda, which will be compiled by nvcc and indeed on that pass I would expect the device version of the lambda to capture an OpenMP reduce. Hmmmm.

Could sanity check this by redefining LVARRAY_HOST_DEVICE to be just __host__ (or nothing) inside the ifdef https://github.com/GEOSX/LvArray/blob/ab0d3025d2962aad10457b7bf7f5b6d266232276/benchmarks/benchmarkReduceKernels.cpp#L84, then redefine it back to __host__ __device__ when exiting the ifdef. If that passed it would confirm that is the issue.

The OMP reduce destructor isn't decorated (as I would expect it not to be) https://github.com/LLNL/RAJA/blob/9e12cc2a1460b0523ff100ceae9b98a2446ad41a/include/RAJA/policy/openmp/reduce.hpp#L56

This error has popped up many times before and it's pretty standard usage so I think it's something that should be fixed on RAJA's end.