HDFGroup / vol-async

Asynchronous I/O for HDF5
https://hdf5-vol-async.readthedocs.io
Other
21 stars 13 forks source link

async_test_multifile.exe fails with segmentation fault #22

Closed BenWibking closed 2 years ago

BenWibking commented 2 years ago

Hi,

I am running on an x86-64 Linux OpenMPI cluster, and I have built following the instructions in the README, but the tests do not complete successfully:

$ make check_serial
python3 ./pytest.py
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -11 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful

The backtrace is:

$ cat async_vol_test.err
[gadi-login-07:3639707:0:3639707] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x110)
==== backtrace (tid:3639707) ====
 0 0x0000000000012c20 .annobin_sigaction.c()  sigaction.c:0
 1 0x0000000000007f5c get_n_running_task_in_queue_obj()  /home/120/bw0729/vol-async/src/h5_async_vol.c:2138
 2 0x0000000000008f0c H5VL_async_request_wait()  /home/120/bw0729/vol-async/src/h5_async_vol.c:24279
 3 0x000000000045238a H5VL__request_wait()  /home/120/bw0729/hdf5/src/H5VLcallback.c:6435
 4 0x00000000004653f6 H5VL_request_wait()  /home/120/bw0729/hdf5/src/H5VLcallback.c:6469
 5 0x0000000000177597 H5ES__wait_cb()  /home/120/bw0729/hdf5/src/H5ESint.c:669
 6 0x0000000000178ce2 H5ES__list_iterate()  /home/120/bw0729/hdf5/src/H5ESlist.c:171
 7 0x00000000001786a4 H5ES__wait()  /home/120/bw0729/hdf5/src/H5ESint.c:754
 8 0x0000000000174130 H5ESwait()  /home/120/bw0729/hdf5/src/H5ES.c:342
 9 0x000000000040129a main()  /home/120/bw0729/vol-async/test/async_test_multifile.c:61
10 0x0000000000023493 __libc_start_main()  ???:0
11 0x000000000040106e _start()  ???:0
=================================
BenWibking commented 2 years ago

For completeness, this is the stdout:

$ cat async_vol_test.out
Compute/sleep for 1 seconds...
Create file [./test_0.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.000074
  Observed write attr time: 0.008439
  Observed total write time: 0.009069
H5ESwait start
H5ESwait done
Compute/sleep for 1 seconds...
Create file [./test_1.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.000071
  Observed write attr time: 0.004971
  Observed total write time: 0.005413
H5ESwait start
houjun commented 2 years ago

Hi @BenWibking , what version of HDF5 and OpenMPI are you using?

houjun commented 2 years ago

Seems like the problem is with OpenMPI and its MPI_THREAD_MULTIPLE support, based on this post, can you try "ompi_info" on your cluster and look for whether MPI_THREAD_MULTIPLE is supported.

BenWibking commented 2 years ago

MPI_THREAD_MULTIPLE is supported in the OpenMPI 4.1.3 build I'm using:

$ ompi_info | grep MPI_THREAD_MULTIPLE
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)

I'm using commit a80897ee4944ff6008bfb3b93619ebcb58a070d1 from the HDF5 repo.

houjun commented 2 years ago

Can you try manually running the test code with and without mpirun: "mpirun -np 1 ./async_test_multifile.exe" and " ./async_test_multifile.exe" to see if the error occurs?

BenWibking commented 2 years ago

Edit: I did not set the environment variables correctly for this run. This is using the wrong HDF5 install.

Both of those work:

[bw0729@gadi-login-03 test]$ mpirun -np 1 ./async_test_multifile.exe
Compute/sleep for 1 seconds...
Create file [./test_0.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.006957
  Observed write attr time: 0.008377
  Observed total write time: 0.044399
H5ESwait start
H5ESwait done
Compute/sleep for 1 seconds...
Create file [./test_1.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.006853
  Observed write attr time: 0.004882
  Observed total write time: 0.015456
H5ESwait start
H5ESwait done
Compute/sleep for 1 seconds...
Create file [./test_2.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.006733
  Observed write attr time: 0.004852
  Observed total write time: 0.014861
H5ESwait start
H5ESwait done
Total execution time: 3.135021
Finalize time: 0.000000
[bw0729@gadi-login-03 test]$ ./async_test_multifile.exe
Compute/sleep for 1 seconds...
Create file [./test_0.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.006415
  Observed write attr time: 0.008508
  Observed total write time: 0.022484
H5ESwait start
H5ESwait done
Compute/sleep for 1 seconds...
Create file [./test_1.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.007626
  Observed write attr time: 0.004874
  Observed total write time: 0.018154
H5ESwait start
H5ESwait done
Compute/sleep for 1 seconds...
Create file [./test_2.h5]
Write dset 0
Write dset 1
  Observed write dset time: 0.006863
  Observed write attr time: 0.004849
  Observed total write time: 0.017120
H5ESwait start
H5ESwait done
Total execution time: 3.109642
Finalize time: 0.000000

Very strangely, the pytest runner for that test also now works (I did not recompile anything):

[bw0729@gadi-login-03 test]$ make check_serial
python3 ./pytest.py
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
Test # 3 : async_test_multifile.exe PASSED
Test # 4 : async_test_serial_event_set.exe PASSED
ERROR: Test async_test_serial_event_set_error_stack.exe : returned non-zero exit status= 255 aborting test
run_cmd= ./async_test_serial_event_set_error_stack.exe
pytest was unsuccessful
houjun commented 2 years ago

Can you try running other tests manually as well, make check* uses a python script to run the tests and it could be python environment causing the error. Alternatively, you can build vol-async with cmake and run the tests with "ctest".