HDFGroup / vol-async

Asynchronous I/O for HDF5
https://hdf5-vol-async.readthedocs.io
Other
19 stars 13 forks source link

Summit crash with hdf5-iotest and > 1 node #16

Closed brtnfld closed 2 years ago

brtnfld commented 2 years ago

When I try to run hdf5-iotest with > 1 node I get a crash, below. It works fine if it is using one node.:

#0  0x000020001ac6bfb4 in ABT_thread_create () from /ccs/home/brtnfld/packages/argobots/build/argobots//lib/libabt.so.1
#1  0x0000200003d98870 in push_task_to_abt_pool (qhead=0x4b22fed0, pool=0x4b2a1980) at h5_async_vol.c:2249
#2  0x0000200003db98e4 in async_file_open (qtype=REGULAR, aid=0x4b22fed0, name=0x7fffdc00e840 "hdf5_iotest.h5", flags=0, fapl_id=792633534417208627, dxpl_id=792633534417207304, req=0x0) at h5_async_vol.c:13253
#3  0x0000200003dd4b3c in H5VL_async_file_open (name=0x7fffdc00e840 "hdf5_iotest.h5", flags=0, fapl_id=792633534417207316, dxpl_id=792633534417207304, req=0x0) at h5_async_vol.c:22141
#4  0x00002000004a85e4 in H5VL__file_open (name=<optimized out>, name@entry=0x7fffdc00e840 "hdf5_iotest.h5", flags=flags@entry=0, fapl_id=<optimized out>, fapl_id@entry=792633534417207316, dxpl_id=<optimized out>, 
    dxpl_id@entry=792633534417207304, req=<optimized out>, req@entry=0x0, cls=<optimized out>, cls=<optimized out>) at ../../src/H5VLcallback.c:3497
#5  0x00002000004b199c in H5VL_file_open (connector_prop=0x7fffdc00e440, name=0x7fffdc00e840 "hdf5_iotest.h5", flags=<optimized out>, fapl_id=792633534417207316, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:3646
#6  0x000020000025346c in H5F__open_api_common (filename=filename@entry=0x7fffdc00e840 "hdf5_iotest.h5", flags=flags@entry=0, fapl_id=<optimized out>, fapl_id@entry=792633534417207316, token_ptr=token_ptr@entry=0x0)
    at ../../src/H5F.c:795
#7  0x0000200000255c38 in H5Fopen_async (app_file=0x1000f878 "../../src/read_test.c", app_func=0x1000fbc8 "read_test", app_line=<optimized out>, filename=0x7fffdc00e840 "hdf5_iotest.h5", flags=<optimized out>, 
    fapl_id=792633534417207316, es_id=0) at ../../src/H5F.c:880
#8  0x0000000010009284 in ?? ()
#9  0x000000001000820c in ?? ()
#10 0x00002000008b4078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#11 0x00002000008b4264 in __libc_start_main () from /lib64/power9/libc.so.6
#12 0x0000000000000000 in ?? ()
brtnfld commented 2 years ago

The VOL tests pass when using two nodes.

houjun commented 2 years ago

@brtnfld Can you try "export ABT_THREAD_STACKSIZE=100000"? This is probably due to the known issue

brtnfld commented 2 years ago

That fixed it. You can close the issue.