HDFGroup / vol-async

Asynchronous I/O for HDF5
https://hdf5-vol-async.readthedocs.io
Other
21 stars 13 forks source link

E3SM-IO failed on 1-process run #37

Open wkliao opened 1 year ago

wkliao commented 1 year ago

I am using the develop branch of vol-async 73a870d to test E3SM-IO benchmark. One of the tests failed. The failed command runs on 1 MPI process, but the same command runs fine with 16 processes.

Below are the related env variables.

HDF5_PLUGIN_PATH=$HOME/ASYNC_VOL/lib
HDF5_VOL_CONNECTOR=async under_vol=0;under_info={}
LD_LIBRARY_PATH=$HOME/ASYNC_VOL/lib:$HOME/Argobots/1.1/lib:$HOME/HDF5/1.14.1-2-thread/lib

Here is the run command.

e3sm_io -k -r 2 -y 2 datasets/map_f_case_16p.h5 -o blob_f_out.h5 -a hdf5 -x blob

Part of GDB trace is given below.

#26 0x00007f717436f218 in H5D__write (count=count@entry=1, dset_info=dset_info@entry=0x7f71565fff00)
    at ../../hdf5-1.14.1-2/src/H5Dio.c:745
#27 0x00007f71745b1f61 in H5VL__native_dataset_write (count=1, obj=<optimized out>, 
    mem_type_id=<optimized out>, mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=<optimized out>, 
    buf=0x191c130, req=0x0) at ../../hdf5-1.14.1-2/src/H5VLnative_dataset.c:407
#28 0x00007f717459db47 in H5VL__dataset_write (cls=<optimized out>, req=0x0, buf=0x191c130, 
    dxpl_id=792633534417207497, file_space_id=0x191b230, mem_space_id=0x1922630, mem_type_id=0x191a430, 
    obj=0x1915350, count=1) at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2236
#29 H5VLdataset_write (count=1, obj=0x1915350, connector_id=648518346341351424, mem_type_id=0x191a430, 
    mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=792633534417207497, buf=0x191c130, req=0x0)
    at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2396
#30 0x00007f71725a8ef0 in async_dataset_write_fn (foo=0x1a335a0)
    at /homes/wkliao/ASYNC_VOL/vol-async/src/h5_async_vol.c:9712
#31 0x00007f717238104a in ABTD_ythread_func_wrapper (p_arg=0x7f71566001e0)
    at ../../argobots-1.1/src/arch/abtd_ythread.c:21
houjun commented 1 year ago

Hi @wkliao, I just tried running e3sm_io with your command on Perlmutter, with the latest vol-async (85c37d4) and HDF5 1.14.2 release, and everything seems to be fine, can you try again with these versions?

wkliao commented 1 year ago

The latest 85c37d4 appears to fix the problem. Thanks for the fix.

FYI. Async VOL is constantly tested when E3SM-IO and Log VOL have new commits pushed to their repo.

Any plan to make a new release?

houjun commented 1 year ago

Yes, I'll do some more testing and release a new version today.

houjun commented 1 year ago

@wkliao, I just released v1.8, please let me know if you find any issue.

wkliao commented 1 year ago

I am getting a test program hanging problem when stacking Log VOL on top of running Cache and Async VOLs without Log VOL. The test program is group.cpp which simply creates 2 HDF5 group objects, and the GitHub action output can be found here. All environment variables used in the test can also be found there.

There is no error message. The test was terminated as it ran out of time.

wkliao commented 1 year ago

Just realized that the failed test program was not using Log VOL. It uses only Cache and Async VOLs. I have revised my previous post accordingly.

houjun commented 1 year ago

Hi @wkliao, I tried the Log VOL group and other basic tests on Perlmutter and they all ran successfully with Cache and Async VOL. So I'm not sure what went wrong there, can you try running the test again? Is there a verbose mode that can print out where it got stuck?

wkliao commented 1 year ago

As this failure happened on the GitHub actions, I suggest to create a new workflow in Async VOL to test group.cpp only. Please use the following software.

   MPICH_VERSION: 4.1.2
   HDF5_VERSION: 1.14.2
   ARGOBOTS_VERSION: 1.1
   ASYNC_VOL_VERSION: 1.8
   Cache VOL: master branch

You can reuse part of the yaml file. Note testing group.cpp requires no Log VOL

wkliao commented 1 year ago

I reran the same GitHub workflow again and it failed (hang) at a different test program. https://github.com/DataLib-ECP/vol-log-based/actions/runs/6159901278/job/16737743501

The test uses the following environment variables. Could you please check whether they are OK.

  export ABT_DIR=${GITHUB_WORKSPACE}/Argobots
  export ASYNC_DIR=${GITHUB_WORKSPACE}/Async
  export CACHE_DIR=${GITHUB_WORKSPACE}/Cache
  export HDF5_DIR=${GITHUB_WORKSPACE}/HDF5
  export HDF5_ROOT=${HDF5_DIR}
  export HDF5_PLUGIN_PATH=${CACHE_DIR}/lib:${ASYNC_DIR}/lib
  export LD_LIBRARY_PATH=${CACHE_DIR}/lib:${ASYNC_DIR}/lib:${ABT_DIR}/lib:${HDF5_DIR}/lib:${LD_LIBRARY_PATH}
  export HDF5_VOL_CONNECTOR="cache_ext config=${GITHUB_WORKSPACE}/cache.cfg;under_vol=512;under_info={under_vol=0;under_info={}}"
  export MPICH_MAX_THREAD_SAFETY=multiple
  export HDF5_USE_FILE_LOCKING=FALSE
  export HDF5_ASYNC_DISABLE_DSET_GET=0
  # Start async execution at file close time
  export HDF5_ASYNC_EXE_FCLOSE=1
  # Start async execution at group close time
  export HDF5_ASYNC_EXE_GCLOSE=1
  # Start async execution at dataset close time
  export HDF5_ASYNC_EXE_DCLOSE=1
  export TEST_NATIVE_VOL_ONLY=1
houjun commented 1 year ago

The HDF5_ASYNCEXE* ones are not necessary but they should be harmless, I'll try setting up an environment the same as the GitHub Actions runner and find out what is causing the hang.

houjun commented 1 year ago

I have a new vol-async 1.8.1 release which seems to fix the hang issue, however, there are new errors with Test stacking Log VOL on top of Cache VOL only - make check Based on the name it doesn't seem to use async vol, so not sure what went wrong.

wkliao commented 1 year ago

The error message says Cache VOL requires the test programs to call MPI_Init_thread() instead of MPI_Init(). Is this true for Cache VOL?

 [CACHE_VOL] ERROR: cache VOL requires MPI to             be initialized with MPI_THREAD_MULTIPLE.             Please use MPI_Init_thread
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
wkliao commented 1 year ago

The hanging problem re-appeared in E3SM-IO. It happened when using Async I/O 1.8.1 + Cache VOLs, without Log VOL. I ran it twice. First occurred at G case and second I case. See GitHub action log at https://github.com/Parallel-NetCDF/E3SM-IO/actions/runs/6242093110

houjun commented 1 year ago

I think Huihuo has been updating Cache VOL actively, probably better for the E3SM-IO tests to use the release version. @zhenghh04, do you see the hanging problem with your tests? Is this related to the group and file close issue we talked about yesterday?

wkliao commented 1 year ago

Currently, there is no release versions in Cache VOL. I have made a request, see https://github.com/hpc-io/vol-cache/issues/22.

zhenghh04 commented 1 year ago

@wkliao if you like, you can try the previous v1.2 release: https://github.com/hpc-io/vol-cache/releases/tag/v1.2.

I'll push a new release soon.

zhenghh04 commented 1 year ago

I think Huihuo has been updating Cache VOL actively, probably better for the E3SM-IO tests to use the release version. @zhenghh04, do you see the hanging problem with your tests? Is this related to the group and file close issue we talked about yesterday?

I see hang issue with F case. Basically, it stops at H5VLfile_close call.

wkliao commented 1 year ago

Hi @zhenghh04 I can see 3 tags and 3 pre-releases in Cache VOL. You can actually make 1.2 an official releases, before making release 1.3. I suggest to also make tags 1.0 and 1.1 official releases, which will make the release history looks formal.