UCL / STIR

Software for Tomographic Image Reconstruction
http://stir.sourceforge.net/
Other
114 stars 95 forks source link

GHA: MacOS ctest failures with OpenMP #1036

Open KrisThielemans opened 2 years ago

KrisThielemans commented 2 years ago

Our MacOS test job now fails consistently, test_PoissonLogLikelihoodWithLinearModelForMeanAndProjData, test_blocks_on_cylindrical_projectors, test_FBP2D, test_DataSymmetriesForBins_PET_CartesianGrid etc now all fail really soon with

Subprocess aborted***Exception:   0.62 sec

see e.g. https://github.com/UCL/STIR/runs/6446997086?check_suite_focus=true#step:8:19

However, this occurs also after simply re-running the same job, e.g. the above job is from a re-run of the 5.0.1 commit, which did work originally.

so, this is no a STIR code change, but something else.

KrisThielemans commented 2 years ago

I've used tmate to ssh into the process. It turns out that this is due to STIR_CONFIG_DIR problems. After setting it to the location in the src/config dir, it works fine.

I thought this was addressed in #1026 and in particular 5af1f0defc762f01c47908eb3b7876100e6e416f (and indeed it was).

I find that on my Windows system, I have the same problem. It's running CMake 3.22.3. I think MacOS GHA is now running CMake 3.22.1.

a CMake problem?

KrisThielemans commented 2 years ago

I've tested this in a tiny repo, and there everything works https://github.com/KrisThielemans/GHA-tests/commit/c77bc32deabc6360cc1d83a4d2fdaa252b930afe

KrisThielemans commented 2 years ago

turns out I forgot to add STIR_CONFIG_DIR to the "manual" tests, https://github.com/UCL/STIR/blob/faf0f70fd8bcaeea9d9695bbde8d52afcc9e9906/src/recon_test/CMakeLists.txt#L136 this is true for other tests as well. see e.g. https://github.com/UCL/STIR/runs/6478973644?check_suite_focus=true#step:7:589

PR in progress

KrisThielemans commented 2 years ago

While the above problem with STIR_CONFIG_DIR is true, it is unrelated to the crashes.

It also turns out that crashes do not occur in debug mode, see the test_OSMAPOSL run passing here https://github.com/KrisThielemans/STIR-1/runs/6484447529?check_suite_focus=true#step:9:105

There is no cmake or compiler version change between the successful and failing job.

cmake version 3.23.1
gcc-11 (Homebrew GCC 11.3.0) 11.3.0
KrisThielemans commented 2 years ago

In fact, checking log files, there seems to be no difference in system/dependencies between the 2 jobs. Weird

KrisThielemans commented 2 years ago

some pointers when going in the RelWithDebInfo build via tmate

lldb test_FBP2D gave (sometimes)

thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010000b1c2 test_FBP2D`stir::VoxelsOnCartesianGrid<float>::~VoxelsOnCartesianGrid(this=0x0000000100f04aa0) at VectorWithOffset.inl:83:5
   80     // as begin_allocated_memory is == 0 in that case, and delete[] 0 doesn't do anything
   81     // (I think). Anyway, we're on the safe side now...
   82     if (this->owns_memory_for_data() && this->capacity() != 0)
-> 83       delete[] this->begin_allocated_memory; 
   84   }
   85  
   86   template <class T>
Target 0: (test_FBP2D) stopped.
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x000000010000b1c2 test_FBP2D`stir::VoxelsOnCartesianGrid<float>::~VoxelsOnCartesianGrid(this=0x0000000100f04aa0) at VectorWithOffset.inl:83:5
    frame #1: 0x000000010006e592 test_FBP2D`stir::ForwardProjectorByBin::set_input(this=0x0000000100f04430, density=<unavailable>) at shared_ptr_base.h:168:16
    frame #2: 0x0000000100016002 test_FBP2D`stir::ReconstructionTests<stir::DiscretisedDensity<3, float> >::construct_input_data(this=0x00007ffeefbfdd10) at ReconstructionTests.h:160:29
    frame #3: 0x00000001000153d6 test_FBP2D`stir::TestFBP2D::run_tests(this=0x00007ffeefbfdd10) at test_FBP2D.cxx:77:31
    frame #4: 0x0000000100299146 test_FBP2D`main(argc=<unavailable>, argv=<unavailable>) at test_FBP2D.cxx:129:23
    frame #5: 0x00007fff205f2f3d libdyld.dylib`start + 1
    frame #6: 0x00007fff205f2f3d libdyld.dylib`start + 1

just running test_DataSymmetriesForBins_PET_CartesianGrid

bash-3.2$ src/recon_test/test_DataSymmetriesForBins_PET_CartesianGrid 
Tests for DataSymmetriesForBins_PET_CartesianGrid
Testing span=1

INFO: Detected view-mashing factor 24 from the number of views (8) and the number of detectors per ring (384).
Adjusting the azimuthal angle offset accordingly (an extra offset of 10.7813 degrees)
        Tests with usual image size

INFO: Determined voxel size by dividing default_bin_size (3.12932) by zoom
test_DataSymmetriesForBins_PET_CartesianGrid(5595,0x106255e00) malloc: Incorrect checksum for freed object 0x7fefadc0a738: probably modified after being freed.
Corrupt value: 0x0
test_DataSymmetriesForBins_PET_CartesianGrid(5595,0x106255e00) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

and lldb

lldb) r
Process 5602 launched: '/Users/runner/work/STIR-1/STIR-1/build/src/recon_test/test_DataSymmetriesForBins_PET_CartesianGrid' (x86_64)
Tests for DataSymmetriesForBins_PET_CartesianGrid
Testing span=1

INFO: Detected view-mashing factor 24 from the number of views (8) and the number of detectors per ring (384).
Adjusting the azimuthal angle offset accordingly (an extra offset of 10.7813 degrees)
        Tests with usual image size

INFO: Determined voxel size by dividing default_bin_size (3.12932) by zoom

WARNING: Disabling symmetries as image is rotated due to phi offset of the scanner.
Process 5602 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1019048de)
    frame #0: 0x00007fff20403bd4 libsystem_malloc.dylib`tiny_free_list_add_ptr + 122
libsystem_malloc.dylib`tiny_free_list_add_ptr:
->  0x7fff20403bd4 <+122>: movw   %r15w, -0x2(%rdx,%rax)
    0x7fff20403bda <+128>: jmp    0x7fff20403be5            ; <+139>
    0x7fff20403bdc <+130>: testw  %r15w, %r15w
    0x7fff20403be0 <+134>: jne    0x7fff20403bea            ; <+144>
Target 0: (test_DataSymmetriesForBins_PET_CartesianGrid) stopped.
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1019048de)
  * frame #0: 0x00007fff20403bd4 libsystem_malloc.dylib`tiny_free_list_add_ptr + 122
    frame #1: 0x00007fff2040133f libsystem_malloc.dylib`tiny_malloc_from_free_list + 1632
    frame #2: 0x00007fff2040075d libsystem_malloc.dylib`tiny_malloc_should_clear + 233
    frame #3: 0x00007fff203ff667 libsystem_malloc.dylib`szone_malloc_should_clear + 66
    frame #4: 0x00007fff20418f3b libsystem_malloc.dylib`_malloc_zone_malloc + 118
    frame #5: 0x0000000100999cd8 libstdc++.6.dylib`operator new(unsigned long) + 24
    frame #6: 0x00000001000aa9a1 test_DataSymmetriesForBins_PET_CartesianGrid`::find_relation_between_coordinate_systems(num_planes_per_scanner_ring=0x00000001018071a4, num_planes_per_axial_pos=0x00000001018071a8, axial_pos_to_z_offset=0x00000001018071d8, proj_data_info_cyl_ptr=0x0000000100e07b70, cartesian_grid_info_ptr=0x0000000100e07fc0) at VectorWithOffset.inl:381:15
    frame #7: 0x00000001000ab30b test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGrid::DataSymmetriesForBins_PET_CartesianGrid(this=0x0000000101807180, proj_data_info_ptr=0x00007ffeefbfd848, image_info_ptr=0x00007ffeefbfd860, do_symmetry_90degrees_min_phi_v=<unavailable>, do_symmetry_180degrees_min_phi_v=<unavailable>, do_symmetry_swap_segment_v=<unavailable>, do_symmetry_swap_s_v=false, do_symmetry_shift_z=false) at DataSymmetriesForBins_PET_CartesianGrid.cxx:362:45
    frame #8: 0x00000001000b3462 test_DataSymmetriesForBins_PET_CartesianGrid`stir::ProjMatrixByBinUsingRayTracing::set_up(this=0x00007ffeefbfd6f0, proj_data_info_ptr_v=<unavailable>, density_info_ptr=0x00007ffeefbfd860) at ProjMatrixByBinUsingRayTracing.cxx:292:68
    frame #9: 0x000000010000d2ee test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGridTests::run_tests_all_symmetries(this=0x00007ffeefbfdd30, proj_data_info_sptr=0x00007ffeefbfdd50, density_sptr=0x00007ffeefbfdc40) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:407:32
    frame #10: 0x00000001000107bb test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGridTests::run_tests_for_1_projdata(this=0x00007ffeefbfdd30, proj_data_info_sptr=0x00007ffeefbfdd50) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:703:27
    frame #11: 0x0000000100011283 test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGridTests::run_tests(this=0x00007ffeefbfdd30) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:815:26
    frame #12: 0x000000010029c560 test_DataSymmetriesForBins_PET_CartesianGrid`main(argc=<unavailable>, argv=<unavailable>) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:852:18
    frame #13: 0x00007fff205f2f3d libdyld.dylib`start + 1
KrisThielemans commented 2 years ago

turns out there still are some failures with Debug https://github.com/UCL/STIR/runs/6539520942?check_suite_focus=true

The following tests FAILED:
      1 - test_DataSymmetriesForBins_PET_CartesianGrid (Subprocess aborted)
      5 - test_PoissonLogLikelihoodWithLinearModelForMeanAndProjData (Subprocess aborted)
      6 - test_OSMAPOSL_ray_tracing_matrix (SEGFAULT)
     12 - test_proj_data_info_subsets (Subprocess aborted)
KrisThielemans commented 2 years ago

note that test_OSMAPOSL_ray_tracing_matrix and test_DataSymmetriesForBins_PET_CartesianGrid did work here: https://github.com/UCL/STIR/runs/6538440399?check_suite_focus=true, so intermittent error (the other 2 weren't run in that job)

KrisThielemans commented 2 years ago

Unexpectedly. #901 's latest test-run still works for the MacOS job. Of course, this could be because the error is intermittent.

That PR is based on master at b8956388b5110b47667c2ea499321ae31748d082 (i.e. after merging #1017).

Checking GHA, the last run on master that worked was for b6f1c2cd84d3645b4e17cdb6e7ced1c8d5b962b5 (merge of #1026). Next update was 770048aa07ef3b97c5094c46b1beb304c28bb5 (just setting version number) which as noted above, the run worked originally, but not on the 2nd attempt.

As the error is intermittent, it is hard to know exactly when this started happening.

KrisThielemans commented 2 years ago

could be race condition, see disabling OpenMP in #1055

KrisThielemans commented 2 years ago

1055 tests were fine. I re-ran them 4 times. So I've merged that in.

KrisThielemans commented 2 years ago

@robbietuk do you still have a MacOS build with OpenMP? Is ctest ok?

robbietuk commented 2 years ago

@robbietuk do you still have a MacOS build with OpenMP? Is ctest ok?

Testing...

robbietuk commented 2 years ago

It looks good on my machine but I dont recall having any issues before #1055