HDFGroup / vol-async

Asynchronous I/O for HDF5
https://hdf5-vol-async.readthedocs.io
Other
19 stars 13 forks source link

signal SIGABRT in testing #33

Open brtnfld opened 1 year ago

brtnfld commented 1 year ago

As I'm developing the FORTRAN async tests in HDF5, I'm seeing an issue with H5Aopen_async_f (backtrace below)

Sometimes the test fails and sometimes it does not. I'm running on 6 ranks.

It is basically doing:


    CALL h5fopen_async_f(filename, H5F_ACC_RDWR_F, file_id, es_id, hdferror, access_prp = fapl_id )
    CALL check("h5fopen_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists0)
    CALL H5Aexists_async_f(file_id, attr_name, f_ptr, es_id, hdferror)
    CALL check("H5Aexists_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists1)
    CALL H5Aexists_async_f(file_id, TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
    CALL check("H5Aexists_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists2)
    CALL H5Aexists_by_name_async_f(file_id, "/", attr_name, f_ptr, es_id, hdferror)
    CALL check("H5Aexists_by_name_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists3)
    CALL H5Aexists_by_name_async_f(file_id, "/", TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
    CALL check("H5Aexists_by_name_async_f",hdferror, total_error)

    CALL H5Aopen_async_f(file_id, attr_name, attr_id0, es_id, hdferror)  <--- fails here
    CALL check("H5Aopen_async_f", hdferror, total_error)
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f5c5f7734e2 in ???
#1  0x7f5c5f772675 in ???
#2  0x7f5c5e280d4f in ???
#3  0x7f5c5e280cbb in ???
#4  0x7f5c5e282354 in ???
#5  0x7f5c5e278cb9 in ???
#6  0x7f5c5e278d41 in ???
#7  0x7f5c60813ec7 in H5F__get_objects_cb
        at ../../src/H5Fint.c:631
#8  0x7f5c608e3555 in H5I__iterate_cb
        at ../../src/H5Iint.c:1526
#9  0x7f5c608e4eb2 in H5I_iterate
        at ../../src/H5Iint.c:1592
#10  0x7f5c60813dc0 in H5F__get_objects
        at ../../src/H5Fint.c:599
#11  0x7f5c608173a0 in H5F_get_obj_count
        at ../../src/H5Fint.c:475
#12  0x7f5c60920b98 in H5O__attr_find_opened_attr
        at ../../src/H5Oattribute.c:661
#13  0x7f5c60921f31 in H5O__attr_open_by_name
        at ../../src/H5Oattribute.c:473
#14  0x7f5c606fcacc in H5A__open
        at ../../src/H5Aint.c:535
#15  0x7f5c60b04368 in H5VL__native_attr_open
        at ../../src/H5VLnative_attr.c:154
#16  0x7f5c60ae073d in H5VL__attr_open
        at ../../src/H5VLcallback.c:1104
#17  0x7f5c60ae8827 in H5VLattr_open
        at ../../src/H5VLcallback.c:1175
#18  0x7f5c60d8527b in async_attr_open_fn
        at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5675
#19  0x7f5c5c1bbc97 in ???
#20  0x7f5c5c1c1e98 in ???
#21  0xffffffffffffffff in ???
brtnfld commented 1 year ago

The branch is https://github.com/brtnfld/hdf5/tree/ASYNC_F

To run the test:

#!/bin/bash
export ABT_DIR=$HOME/work/argobots/build/argobots/
export HDF5_DIR=$HOME/work/hdf5.brtnfld/build/hdf5

export LD_LIBRARY_PATH="$HDF5_DIR/lib64:$HOME/packages/szip-2.1.1/szip/lib64:$ABT_DIR/lib64:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/work/vol-async/build/lib"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}"

mpiexec -n 6 ./async_test
brtnfld commented 1 year ago

I'm also getting hanging periodically with 8 ranks, but that is probably a separate issue:

#0  0x00007f5178db5890 in pool_pop_shared () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#1  0x00007f5178db9aea in sched_run () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#2  0x00007f5178da59b9 in thread_main_sched_func () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#3  0x00007f5178db3c98 in ABTD_ythread_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#4  0x00007f5178da5469 in ABTD_ythread_context_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#5  0x0000000000000000 in ?? ()
houjun commented 1 year ago

@brtnfld , can you add your full test code file here?

brtnfld commented 1 year ago

async.F90.gz

It is also here: https://github.com/brtnfld/hdf5/blob/ASYNC_F/fortran/testpar/async.F90

line 252 is the issue.

houjun commented 1 year ago

Got it. Is there a C version of this test code?

brtnfld commented 1 year ago

No, only Fortran.

houjun commented 1 year ago

@brtnfld I'm able to reproduce the error. After some debugging, this appears to be an old issue that I thought was resolved by HDF5 previously, but looks like it is either recurring or I was not testing the case very well before.

Basically, the issue comes from HDF5 trying to check whether an attribute is already opened in H5Oattribute.c and it seems to not like the future ID used by async vol when some are already created/opened and some are not. I found two workarounds that will not cause this error:

  1. Comment out lines 473-479 and 512 of H5Oattribute.c, this way HDF5 won't check for already opened attributes and things will be fine.
  2. Do "export HDF5_ASYNC_EXE_FCLOSE=1" before you run the test program, it will force async vol to not start executing the I/O operations until ESwait or Fclose are called, and the attribute ids are true future ids that have not been filled by async vol.

I forgot whether it was Neil or Jordan who looked at this issue before, can you check with them and see if there is a better solution?

Also, the test code seems to always segfault at the end:

nid00074:testpar$ srun -n 6 ./async_test
H5ES API tests                                                                          PASSED
H5A async API tests                                                                     PASSED
srun: error: nid00074: tasks 0-4: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=520944.42
srun: error: nid00074: task 5: Segmentation fault
brtnfld commented 1 year ago

Thanks, I'll ask Jordan and Neil. I've not seen that segmentation fault before. Though I've only run it on a local desktop.

brtnfld commented 1 year ago

BTW, even if I add an ESwait after the last exists, it still fails.

houjun commented 1 year ago

@brtnfld does setting the environment variable work for you? I don't think adding an ESwait would help, the issue seems to be from HDF5 checking the cached attribute.

brtnfld commented 1 year ago

Yes, HDF5_ASYNC_EXE_FCLOSE fixes the issue.

fortnern commented 1 year ago

@houjun could you share more details of your debugging? Looking through the future ID code I'm having trouble understanding how this could happen.

houjun commented 1 year ago

Hi @fortnern , I have tried two things in my debugging that seem to fix this issue, the first is to comment out the code in HDF5 library (473-479 and 512 of H5Oattribute.c) so that HDF5 doesn't check whether an attribute is already opened. The second is in vol-async, I can delay the execution of all the attribute operations to a later time (e.g. at file close time). My guess for the cause is there may be something wrong when the library is checking its cached attributes, it either doesn't like the future ID or there's some interference from vol-async. Although the interference seems unlikely as there can be only one thread performing HDF5 operations as threadsafty is turned on.

houjun commented 1 year ago

@brtnfld @fortnern, can you check if the latest develop branch fixes all the Fortran test issues?

brtnfld commented 1 year ago

It passes most of the time, but running it over and over, I can sometimes get it to fail with:


async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f48829964e2 in ???
#1  0x7f4882995675 in ???
#2  0x7f4880518cff in ???
#3  0x7f4880518c6b in ???
#4  0x7f488051a304 in ???
#5  0x7f4880510c69 in ???
#6  0x7f4880510cf1 in ???
#7  0x7f4883466b0a in H5F__get_objects_cb
        at ../../src/H5Fint.c:631
#8  0x7f4883536331 in H5I__iterate_cb
        at ../../src/H5Iint.c:1526
#9  0x7f4883537c5a in H5I_iterate
        at ../../src/H5Iint.c:1592
#10  0x7f4883466a03 in H5F__get_objects
        at ../../src/H5Fint.c:599
#11  0x7f4883469ff4 in H5F_get_obj_count
        at ../../src/H5Fint.c:475
#12  0x7f4883573ffd in H5O__attr_find_opened_attr
        at ../../src/H5Oattribute.c:661
#13  0x7f488357539b in H5O__attr_open_by_name
        at ../../src/H5Oattribute.c:473
#14  0x7f488334df3f in H5A__open
        at ../../src/H5Aint.c:535
#15  0x7f488375e753 in H5VL__native_attr_open
        at ../../src/H5VLnative_attr.c:158
#16  0x7f488373aaac in H5VL__attr_open
        at ../../src/H5VLcallback.c:1104
#17  0x7f4883742b96 in H5VLattr_open
        at ../../src/H5VLcallback.c:1175
#18  0x7f47f20c2737 in async_attr_open_fn
        at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5772
#19  0x7f47f209bc97 in ???
#20  0x7f47f20a1e78 in ???
#21  0xffffffffffffffff in ???
houjun commented 1 year ago

@brtnfld I think this is probably the same issue I mentioned earlier with the opened attribute, did you set "export HDF5_ASYNC_EXE_FCLOSE=1"? In my previous debugging, the issue seems to come from searching the cached attributes in the library, my guess is the (filled) future id is not handled properly by the library, I'll see if I can find more this week.

brtnfld commented 1 year ago

That was my mistake. It got removed in my editing of the run script. Using that, all the test pass.