GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
222 stars 89 forks source link

Could not run testRestartBasic nor testRestartExtended #643

Closed TotoGaz closed 4 years ago

TotoGaz commented 4 years ago

On SLES 11 with gcc 8.3.0 and openmpi-2.1.6. Using GEOSX develop:0d01262450090e3b8ddcb844a6c9c6c42c431a6e and

> git submodule status
 2f1b1d334c8725947bd5d78ed454ed7ca2ae39b2 integratedTests (remotes/origin/feature/useFullStress)
 96419df7bcc43804f0a1d5baa0a13a35673f1ddb src/cmake/blt (v0.2.0-119-g96419df)
 1fa348ecb9efcd11a76c67d2de24907665023ead src/coreComponents/cxx-utilities (heads/develop)
 a1fe721c40ce7e66c5b3eee54b602a411c3b9b4a src/coreComponents/fileIO/coupling/hdf5_interface (heads/master)
 67ecf1b19c73c2e1c26a8f0e9d27e3984a741804 src/externalComponents/GEOSX_PTP (heads/master)
 e6b341819dd3fba6e8ce5b53fe5c8c3dbc2269ad src/externalComponents/PAMELA (heads/master)
 33118bf21521d2ac5d2e9b1b740f4e4376a86a00 src/externalComponents/PVTPackage (remotes/origin/feature/corbett/mesh-maps)

I could not run the testRestartBasic and testRestartExtended unit tests.

    Start 32: testRestartBasic

32: Test command: /work206/workrd/DEV/gazzola/GEOSX/GEOSX/build-pandev14-relwithdebinfo/tests/testRestartBasic
32: Test timeout computed to be: 1500
32: [==========] Running 32 tests from 32 test cases.
32: [----------] Global test environment set-up.
32: [----------] 1 test from SingleWrapperTest/0, where TypeParam = int
32: [ RUN      ] SingleWrapperTest/0.WriteAndRead
32: unknown file: Failure
32: C++ exception with description "
32: {
32:   "file": "/work206/workrd/DEV/gazzola/GEOSX/thirdPartyLibs/build-pandev14-release/conduit/src/conduit/src/libs/relay/conduit_relay_io_hdf5.cpp",
32:   "line": 1888,
32:   "message": "HDF5 Error code-1 Error opening HDF5 file for writing: testRestartBasic_SingleWrapperTest.root"
32: }
32: " thrown in the test body.
32: [  FAILED  ] SingleWrapperTest/0.WriteAndRead, where TypeParam = int (2 ms)
32: [----------] 1 test from SingleWrapperTest/0 (2 ms total)
32: 
32: [----------] 1 test from SingleWrapperTest/1, where TypeParam = double
32: [ RUN      ] SingleWrapperTest/1.WriteAndRead
32: unknown file: Failure
32: C++ exception with description "
32: {
32:   "file": "/work206/workrd/DEV/gazzola/GEOSX/thirdPartyLibs/build-pandev14-release/conduit/src/conduit/src/libs/relay/conduit_relay_io_hdf5.cpp",
32:   "line": 1888,
32:   "message": "HDF5 Error code-1 Error opening HDF5 file for writing: testRestartBasic_SingleWrapperTest.root"
32: }
32: " thrown in the test body.
32: [  FAILED  ] SingleWrapperTest/1.WriteAndRead, where TypeParam = double (1 ms)
32: [----------] 1 test from SingleWrapperTest/1 (1 ms total)
32: 

SNIP

32: [----------] 1 test from SingleWrapperTest/31, where TypeParam = geosx::mapBase<long, int, std::integral_constant<bool, false> >
32: [ RUN      ] SingleWrapperTest/31.WriteAndRead
32: unknown file: Failure
32: C++ exception with description "
32: {
32:   "file": "/work206/workrd/DEV/gazzola/GEOSX/thirdPartyLibs/build-pandev14-release/conduit/src/conduit/src/libs/relay/conduit_relay_io_hdf5.cpp",
32:   "line": 1888,
32:   "message": "HDF5 Error code-1 Error opening HDF5 file for writing: testRestartBasic_SingleWrapperTest.root"
32: }
32: " thrown in the test body.
32: [  FAILED  ] SingleWrapperTest/31.WriteAndRead, where TypeParam = geosx::mapBase<long, int, std::integral_constant<bool, false> > (0 ms)
32: [----------] 1 test from SingleWrapperTest/31 (0 ms total)
32: 
32: [----------] Global test environment tear-down
32: [==========] 32 tests from 32 test cases ran. (29 ms total)
32: [  PASSED  ] 0 tests.
32: [  FAILED  ] 32 tests, listed below:
32: [  FAILED  ] SingleWrapperTest/0.WriteAndRead, where TypeParam = int
32: [  FAILED  ] SingleWrapperTest/1.WriteAndRead, where TypeParam = double
32: [  FAILED  ] SingleWrapperTest/2.WriteAndRead, where TypeParam = R1TensorT<3>

SNIP

32: [  FAILED  ] SingleWrapperTest/30.WriteAndRead, where TypeParam = geosx::mapBase<long, int, std::integral_constant<bool, true> >
32: [  FAILED  ] SingleWrapperTest/31.WriteAndRead, where TypeParam = geosx::mapBase<long, int, std::integral_constant<bool, false> >
32: 
32: 32 FAILED TESTS
1/1 Test #32: testRestartBasic .................***Failed    0.35 sec

For the extended version, here is what I have

test 33
    Start 33: testRestartExtended

33: Test command: /work206/workrd/DEV/gazzola/GEOSX/GEOSX/build-pandev14-relwithdebinfo/tests/testRestartExtended
33: Test timeout computed to be: 1500
33: [==========] Running 1 test from 1 test case.
33: [----------] Global test environment set-up.
33: [----------] 1 test from testRestartExtended
33: [ RUN      ] testRestartExtended.testRestartExtended
33: unknown file: Failure
33: C++ exception with description "
33: {
33:   "file": "/work206/workrd/DEV/gazzola/GEOSX/thirdPartyLibs/build-pandev14-release/conduit/src/conduit/src/libs/relay/conduit_relay_io_hdf5.cpp",
33:   "line": 1888,
33:   "message": "HDF5 Error code-1 Error opening HDF5 file for writing: testRestartExtended.root"
33: }
33: " thrown in the test body.
33: [  FAILED  ] testRestartExtended.testRestartExtended (11 ms)
33: [----------] 1 test from testRestartExtended (11 ms total)
33: 
33: [----------] Global test environment tear-down
33: [==========] 1 test from 1 test case ran. (12 ms total)
33: [  PASSED  ] 0 tests.
33: [  FAILED  ] 1 test, listed below:
33: [  FAILED  ] testRestartExtended.testRestartExtended
33: 
33:  1 FAILED TEST
1/1 Test #33: testRestartExtended ..............***Failed    0.33 sec

If I try to run a simple Laplace case on 1 core, I get also get (I do not know if this is related)

DBCreate: File not found or invalid permissions: siloFiles/data/plot_00000001.000
DBCreate: File not found or invalid permissions: siloFiles/plot_00000001
DBPutUcdmesh: File was closed or never opened/created.: siloFiles/plot_00000001
DBPutZonelist2: File was closed or never opened/created.: siloFiles/plot_00000001
DBGetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBPutMultimesh: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBPutMaterial: File was closed or never opened/created.: siloFiles/plot_00000001
DBGetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBPutMultimat: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBMkDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBMkDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBGetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBInqMeshtype: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBInqMeshtype: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBInqMeshtype: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
DBSetDir: File was closed or never opened/created.: siloFiles/plot_00000001
****************************************************************************************************
[ERROR in line 2699 of file /work206/workrd/DEV/gazzola/GEOSX/GEOSX/src/coreComponents/fileIO/silo/SiloFile.cpp]
unhandled case in SiloFile::WriteDataField A

** StackTrace of 12 frames **
Frame 1: axom::slic::logErrorMessage(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)
Frame 2: void geosx::SiloFile::WriteMaterialDataField<double, R2SymTensorT<3> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, LvArray::Array<LvArray::Array<LvArray::ArrayView<R2SymTensorT<3> const, 2, 1, long, LvArray::ChaiBuffer>, 1, camp::int_seq<long, 0l>, long, LvArray::ChaiBuffer>, 1, camp::int_seq<long, 0l>, long, LvArray::ChaiBuffer> const&, geosx::ElementRegionBase const*, int, int, double, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, LvArray::Array<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, 1, camp::int_seq<long, 0l>, long, LvArray::ChaiBuffer> const&)
Frame 3: void geosx::SiloFile::WriteMaterialDataField2d<double, R2SymTensorT<3> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, geosx::ElementRegionBase const*, int, int, double, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, LvArray::Array<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, 1, camp::int_seq<long, 0l>, long, LvArray::ChaiBuffer> const&)
Frame 4: geosx::SiloFile::WriteMaterialMapsFullStorage(geosx::ElementRegionBase const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, LvArray::Array<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, 1, camp::int_seq<long, 0l>, long, LvArray::ChaiBuffer> const&, int, double)
Frame 5: geosx::SiloFile::WriteElementMesh(geosx::ElementRegionBase const*, geosx::NodeManager const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, double**, long long const*, char const*, int, double, bool&)
Frame 6: geosx::SiloFile::WriteMeshLevel(geosx::MeshLevel const*, int, double, bool)
Frame 7: geosx::SiloOutput::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 8: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 9: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 10: main
Frame 11: __libc_start_main
Frame 12: bin/geosx() [0x403f55]
=====
rrsettgast commented 4 years ago

Looks like you are unable to create the silo/hdf files. Are you able to test the hdf that you have built?

TotoGaz commented 4 years ago

I've been able to perform other tests. It happens that the version I've build does not work on the lustre file systems of Pangea. But this version is OK on nfs or tmpfs for example.

On hickory, all versions works seamlessly with the lustre fs...

Did you experience or hear about strange behaviors with lustre (or exotic lustre configurations)?

rrsettgast commented 4 years ago

@TotoGaz can you tell me where to find gcc8 on pangea2?

TotoGaz commented 4 years ago

@TotoGaz can you tell me where to find gcc8 on pangea2?

I could not find it so I had to recompile it myself. Can you access /workrd/DEV/gazzola/GEOSX/source_me.sh? It should define the env for gcc/g++/gfortran, but also cmake and openmpi 2.1.6 recompiled with this gcc 8.

In the middle of the shell file, there is a bloody module load intel-compxe/18.0.3.222 so i have the MKL (I had neither BLAS nor LAPLACK otherwise). It's not good but I do not think it will influence our current issue (I may be mistaken).

In the same folder you will find GEOSX.tar and thirdpartyLibs.tar that you can use instead of fighting with git/git-lfs/proxy. They are clean checkout of a recent version of GEOSX. Use it wisely ;-)

Note also that I did try GEOSX before and after https://github.com/GEOSX/GEOSX/commit/603c4d1671ac08dc9b455c0d6d5af0198f9f6742 (I thought that it may have an influence). I had read/write issues too (in Sidre though).

rrsettgast commented 4 years ago

@TotoGaz Try running after exporting this:

export HDF5_USE_FILE_LOCKING=FALSE
rrsettgast commented 4 years ago

@TotoGaz Also, why SLES11 and not RHEL7 on Pangea2?

rrsettgast commented 4 years ago

@TotoGaz I built using icc19 on RHEL7. Ran the sedov problem successfully with hdf5 output after exporting that env variable.

TotoGaz commented 4 years ago

@TotoGaz Try running after exporting this:

export HDF5_USE_FILE_LOCKING=FALSE

Great, it's working. I could run the unit tests and a simple Laplace case. I'll deal with Integrated tests a little later.

Also, why SLES11 and not RHEL7 on Pangea2?

I don't really know, I bet this was the best global vendor answer during the call for tenders. The global IT and the HPC IT are in some kind separated.

TotoGaz commented 4 years ago

Thanks for your help @rrsettgast! I'm closing the issue.