Open wkliao opened 1 year ago
Hi @wkliao, I just tried running e3sm_io with your command on Perlmutter, with the latest vol-async (85c37d4) and HDF5 1.14.2 release, and everything seems to be fine, can you try again with these versions?
Yes, I'll do some more testing and release a new version today.
@wkliao, I just released v1.8, please let me know if you find any issue.
I am getting a test program hanging problem when stacking Log VOL on top of running Cache and Async VOLs without Log VOL. The test program is group.cpp which simply creates 2 HDF5 group objects, and the GitHub action output can be found here. All environment variables used in the test can also be found there.
There is no error message. The test was terminated as it ran out of time.
Just realized that the failed test program was not using Log VOL. It uses only Cache and Async VOLs. I have revised my previous post accordingly.
Hi @wkliao, I tried the Log VOL group and other basic tests on Perlmutter and they all ran successfully with Cache and Async VOL. So I'm not sure what went wrong there, can you try running the test again? Is there a verbose mode that can print out where it got stuck?
As this failure happened on the GitHub actions, I suggest to create a new workflow in Async VOL to test group.cpp only. Please use the following software.
MPICH_VERSION: 4.1.2
HDF5_VERSION: 1.14.2
ARGOBOTS_VERSION: 1.1
ASYNC_VOL_VERSION: 1.8
Cache VOL: master branch
You can reuse part of the yaml file. Note testing group.cpp requires no Log VOL
I reran the same GitHub workflow again and it failed (hang) at a different test program. https://github.com/DataLib-ECP/vol-log-based/actions/runs/6159901278/job/16737743501
The test uses the following environment variables. Could you please check whether they are OK.
export ABT_DIR=${GITHUB_WORKSPACE}/Argobots
export ASYNC_DIR=${GITHUB_WORKSPACE}/Async
export CACHE_DIR=${GITHUB_WORKSPACE}/Cache
export HDF5_DIR=${GITHUB_WORKSPACE}/HDF5
export HDF5_ROOT=${HDF5_DIR}
export HDF5_PLUGIN_PATH=${CACHE_DIR}/lib:${ASYNC_DIR}/lib
export LD_LIBRARY_PATH=${CACHE_DIR}/lib:${ASYNC_DIR}/lib:${ABT_DIR}/lib:${HDF5_DIR}/lib:${LD_LIBRARY_PATH}
export HDF5_VOL_CONNECTOR="cache_ext config=${GITHUB_WORKSPACE}/cache.cfg;under_vol=512;under_info={under_vol=0;under_info={}}"
export MPICH_MAX_THREAD_SAFETY=multiple
export HDF5_USE_FILE_LOCKING=FALSE
export HDF5_ASYNC_DISABLE_DSET_GET=0
# Start async execution at file close time
export HDF5_ASYNC_EXE_FCLOSE=1
# Start async execution at group close time
export HDF5_ASYNC_EXE_GCLOSE=1
# Start async execution at dataset close time
export HDF5_ASYNC_EXE_DCLOSE=1
export TEST_NATIVE_VOL_ONLY=1
The HDF5_ASYNCEXE* ones are not necessary but they should be harmless, I'll try setting up an environment the same as the GitHub Actions runner and find out what is causing the hang.
I have a new vol-async 1.8.1 release which seems to fix the hang issue, however, there are new errors with Test stacking Log VOL on top of Cache VOL only - make check Based on the name it doesn't seem to use async vol, so not sure what went wrong.
The error message says Cache VOL requires the test programs to call MPI_Init_thread()
instead of MPI_Init()
. Is this true for Cache VOL?
[CACHE_VOL] ERROR: cache VOL requires MPI to be initialized with MPI_THREAD_MULTIPLE. Please use MPI_Init_thread
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
The hanging problem re-appeared in E3SM-IO. It happened when using Async I/O 1.8.1 + Cache VOLs, without Log VOL. I ran it twice. First occurred at G case and second I case. See GitHub action log at https://github.com/Parallel-NetCDF/E3SM-IO/actions/runs/6242093110
I think Huihuo has been updating Cache VOL actively, probably better for the E3SM-IO tests to use the release version. @zhenghh04, do you see the hanging problem with your tests? Is this related to the group and file close issue we talked about yesterday?
Currently, there is no release versions in Cache VOL. I have made a request, see https://github.com/hpc-io/vol-cache/issues/22.
@wkliao if you like, you can try the previous v1.2 release: https://github.com/hpc-io/vol-cache/releases/tag/v1.2.
I'll push a new release soon.
I think Huihuo has been updating Cache VOL actively, probably better for the E3SM-IO tests to use the release version. @zhenghh04, do you see the hanging problem with your tests? Is this related to the group and file close issue we talked about yesterday?
I see hang issue with F case. Basically, it stops at H5VLfile_close call.
Hi @zhenghh04 I can see 3 tags and 3 pre-releases in Cache VOL. You can actually make 1.2 an official releases, before making release 1.3. I suggest to also make tags 1.0 and 1.1 official releases, which will make the release history looks formal.
I am using the develop branch of vol-async 73a870d to test E3SM-IO benchmark. One of the tests failed. The failed command runs on 1 MPI process, but the same command runs fine with 16 processes.
Below are the related env variables.
Here is the run command.
Part of GDB trace is given below.