Open brtnfld opened 1 year ago
The branch is https://github.com/brtnfld/hdf5/tree/ASYNC_F
To run the test:
#!/bin/bash
export ABT_DIR=$HOME/work/argobots/build/argobots/
export HDF5_DIR=$HOME/work/hdf5.brtnfld/build/hdf5
export LD_LIBRARY_PATH="$HDF5_DIR/lib64:$HOME/packages/szip-2.1.1/szip/lib64:$ABT_DIR/lib64:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/work/vol-async/build/lib"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}"
mpiexec -n 6 ./async_test
I'm also getting hanging periodically with 8 ranks, but that is probably a separate issue:
#0 0x00007f5178db5890 in pool_pop_shared () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#1 0x00007f5178db9aea in sched_run () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#2 0x00007f5178da59b9 in thread_main_sched_func () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#3 0x00007f5178db3c98 in ABTD_ythread_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#4 0x00007f5178da5469 in ABTD_ythread_context_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#5 0x0000000000000000 in ?? ()
@brtnfld , can you add your full test code file here?
It is also here: https://github.com/brtnfld/hdf5/blob/ASYNC_F/fortran/testpar/async.F90
line 252 is the issue.
Got it. Is there a C version of this test code?
No, only Fortran.
@brtnfld I'm able to reproduce the error. After some debugging, this appears to be an old issue that I thought was resolved by HDF5 previously, but looks like it is either recurring or I was not testing the case very well before.
Basically, the issue comes from HDF5 trying to check whether an attribute is already opened in H5Oattribute.c and it seems to not like the future ID used by async vol when some are already created/opened and some are not. I found two workarounds that will not cause this error:
I forgot whether it was Neil or Jordan who looked at this issue before, can you check with them and see if there is a better solution?
Also, the test code seems to always segfault at the end:
nid00074:testpar$ srun -n 6 ./async_test
H5ES API tests PASSED
H5A async API tests PASSED
srun: error: nid00074: tasks 0-4: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=520944.42
srun: error: nid00074: task 5: Segmentation fault
Thanks, I'll ask Jordan and Neil. I've not seen that segmentation fault before. Though I've only run it on a local desktop.
BTW, even if I add an ESwait after the last exists, it still fails.
@brtnfld does setting the environment variable work for you? I don't think adding an ESwait would help, the issue seems to be from HDF5 checking the cached attribute.
Yes, HDF5_ASYNC_EXE_FCLOSE fixes the issue.
@houjun could you share more details of your debugging? Looking through the future ID code I'm having trouble understanding how this could happen.
Hi @fortnern , I have tried two things in my debugging that seem to fix this issue, the first is to comment out the code in HDF5 library (473-479 and 512 of H5Oattribute.c) so that HDF5 doesn't check whether an attribute is already opened. The second is in vol-async, I can delay the execution of all the attribute operations to a later time (e.g. at file close time). My guess for the cause is there may be something wrong when the library is checking its cached attributes, it either doesn't like the future ID or there's some interference from vol-async. Although the interference seems unlikely as there can be only one thread performing HDF5 operations as threadsafty is turned on.
@brtnfld @fortnern, can you check if the latest develop branch fixes all the Fortran test issues?
It passes most of the time, but running it over and over, I can sometimes get it to fail with:
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x7f48829964e2 in ???
#1 0x7f4882995675 in ???
#2 0x7f4880518cff in ???
#3 0x7f4880518c6b in ???
#4 0x7f488051a304 in ???
#5 0x7f4880510c69 in ???
#6 0x7f4880510cf1 in ???
#7 0x7f4883466b0a in H5F__get_objects_cb
at ../../src/H5Fint.c:631
#8 0x7f4883536331 in H5I__iterate_cb
at ../../src/H5Iint.c:1526
#9 0x7f4883537c5a in H5I_iterate
at ../../src/H5Iint.c:1592
#10 0x7f4883466a03 in H5F__get_objects
at ../../src/H5Fint.c:599
#11 0x7f4883469ff4 in H5F_get_obj_count
at ../../src/H5Fint.c:475
#12 0x7f4883573ffd in H5O__attr_find_opened_attr
at ../../src/H5Oattribute.c:661
#13 0x7f488357539b in H5O__attr_open_by_name
at ../../src/H5Oattribute.c:473
#14 0x7f488334df3f in H5A__open
at ../../src/H5Aint.c:535
#15 0x7f488375e753 in H5VL__native_attr_open
at ../../src/H5VLnative_attr.c:158
#16 0x7f488373aaac in H5VL__attr_open
at ../../src/H5VLcallback.c:1104
#17 0x7f4883742b96 in H5VLattr_open
at ../../src/H5VLcallback.c:1175
#18 0x7f47f20c2737 in async_attr_open_fn
at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5772
#19 0x7f47f209bc97 in ???
#20 0x7f47f20a1e78 in ???
#21 0xffffffffffffffff in ???
@brtnfld I think this is probably the same issue I mentioned earlier with the opened attribute, did you set "export HDF5_ASYNC_EXE_FCLOSE=1"? In my previous debugging, the issue seems to come from searching the cached attributes in the library, my guess is the (filled) future id is not handled properly by the library, I'll see if I can find more this week.
That was my mistake. It got removed in my editing of the run script. Using that, all the test pass.
As I'm developing the FORTRAN async tests in HDF5, I'm seeing an issue with H5Aopen_async_f (backtrace below)
Sometimes the test fails and sometimes it does not. I'm running on 6 ranks.
It is basically doing: