Open KrisThielemans opened 2 years ago
I've used tmate
to ssh into the process. It turns out that this is due to STIR_CONFIG_DIR
problems. After setting it to the location in the src/config
dir, it works fine.
I thought this was addressed in #1026 and in particular 5af1f0defc762f01c47908eb3b7876100e6e416f (and indeed it was).
I find that on my Windows system, I have the same problem. It's running CMake 3.22.3. I think MacOS GHA is now running CMake 3.22.1.
a CMake problem?
I've tested this in a tiny repo, and there everything works https://github.com/KrisThielemans/GHA-tests/commit/c77bc32deabc6360cc1d83a4d2fdaa252b930afe
turns out I forgot to add STIR_CONFIG_DIR
to the "manual" tests, https://github.com/UCL/STIR/blob/faf0f70fd8bcaeea9d9695bbde8d52afcc9e9906/src/recon_test/CMakeLists.txt#L136
this is true for other tests as well.
see e.g. https://github.com/UCL/STIR/runs/6478973644?check_suite_focus=true#step:7:589
PR in progress
While the above problem with STIR_CONFIG_DIR
is true, it is unrelated to the crashes.
It also turns out that crashes do not occur in debug mode, see the test_OSMAPOSL
run passing here https://github.com/KrisThielemans/STIR-1/runs/6484447529?check_suite_focus=true#step:9:105
There is no cmake or compiler version change between the successful and failing job.
cmake version 3.23.1
gcc-11 (Homebrew GCC 11.3.0) 11.3.0
In fact, checking log files, there seems to be no difference in system/dependencies between the 2 jobs. Weird
some pointers when going in the RelWithDebInfo
build via tmate
lldb test_FBP2D
gave (sometimes)
thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010000b1c2 test_FBP2D`stir::VoxelsOnCartesianGrid<float>::~VoxelsOnCartesianGrid(this=0x0000000100f04aa0) at VectorWithOffset.inl:83:5
80 // as begin_allocated_memory is == 0 in that case, and delete[] 0 doesn't do anything
81 // (I think). Anyway, we're on the safe side now...
82 if (this->owns_memory_for_data() && this->capacity() != 0)
-> 83 delete[] this->begin_allocated_memory;
84 }
85
86 template <class T>
Target 0: (test_FBP2D) stopped.
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
* frame #0: 0x000000010000b1c2 test_FBP2D`stir::VoxelsOnCartesianGrid<float>::~VoxelsOnCartesianGrid(this=0x0000000100f04aa0) at VectorWithOffset.inl:83:5
frame #1: 0x000000010006e592 test_FBP2D`stir::ForwardProjectorByBin::set_input(this=0x0000000100f04430, density=<unavailable>) at shared_ptr_base.h:168:16
frame #2: 0x0000000100016002 test_FBP2D`stir::ReconstructionTests<stir::DiscretisedDensity<3, float> >::construct_input_data(this=0x00007ffeefbfdd10) at ReconstructionTests.h:160:29
frame #3: 0x00000001000153d6 test_FBP2D`stir::TestFBP2D::run_tests(this=0x00007ffeefbfdd10) at test_FBP2D.cxx:77:31
frame #4: 0x0000000100299146 test_FBP2D`main(argc=<unavailable>, argv=<unavailable>) at test_FBP2D.cxx:129:23
frame #5: 0x00007fff205f2f3d libdyld.dylib`start + 1
frame #6: 0x00007fff205f2f3d libdyld.dylib`start + 1
just running test_DataSymmetriesForBins_PET_CartesianGrid
bash-3.2$ src/recon_test/test_DataSymmetriesForBins_PET_CartesianGrid
Tests for DataSymmetriesForBins_PET_CartesianGrid
Testing span=1
INFO: Detected view-mashing factor 24 from the number of views (8) and the number of detectors per ring (384).
Adjusting the azimuthal angle offset accordingly (an extra offset of 10.7813 degrees)
Tests with usual image size
INFO: Determined voxel size by dividing default_bin_size (3.12932) by zoom
test_DataSymmetriesForBins_PET_CartesianGrid(5595,0x106255e00) malloc: Incorrect checksum for freed object 0x7fefadc0a738: probably modified after being freed.
Corrupt value: 0x0
test_DataSymmetriesForBins_PET_CartesianGrid(5595,0x106255e00) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6
and lldb
lldb) r
Process 5602 launched: '/Users/runner/work/STIR-1/STIR-1/build/src/recon_test/test_DataSymmetriesForBins_PET_CartesianGrid' (x86_64)
Tests for DataSymmetriesForBins_PET_CartesianGrid
Testing span=1
INFO: Detected view-mashing factor 24 from the number of views (8) and the number of detectors per ring (384).
Adjusting the azimuthal angle offset accordingly (an extra offset of 10.7813 degrees)
Tests with usual image size
INFO: Determined voxel size by dividing default_bin_size (3.12932) by zoom
WARNING: Disabling symmetries as image is rotated due to phi offset of the scanner.
Process 5602 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1019048de)
frame #0: 0x00007fff20403bd4 libsystem_malloc.dylib`tiny_free_list_add_ptr + 122
libsystem_malloc.dylib`tiny_free_list_add_ptr:
-> 0x7fff20403bd4 <+122>: movw %r15w, -0x2(%rdx,%rax)
0x7fff20403bda <+128>: jmp 0x7fff20403be5 ; <+139>
0x7fff20403bdc <+130>: testw %r15w, %r15w
0x7fff20403be0 <+134>: jne 0x7fff20403bea ; <+144>
Target 0: (test_DataSymmetriesForBins_PET_CartesianGrid) stopped.
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1019048de)
* frame #0: 0x00007fff20403bd4 libsystem_malloc.dylib`tiny_free_list_add_ptr + 122
frame #1: 0x00007fff2040133f libsystem_malloc.dylib`tiny_malloc_from_free_list + 1632
frame #2: 0x00007fff2040075d libsystem_malloc.dylib`tiny_malloc_should_clear + 233
frame #3: 0x00007fff203ff667 libsystem_malloc.dylib`szone_malloc_should_clear + 66
frame #4: 0x00007fff20418f3b libsystem_malloc.dylib`_malloc_zone_malloc + 118
frame #5: 0x0000000100999cd8 libstdc++.6.dylib`operator new(unsigned long) + 24
frame #6: 0x00000001000aa9a1 test_DataSymmetriesForBins_PET_CartesianGrid`::find_relation_between_coordinate_systems(num_planes_per_scanner_ring=0x00000001018071a4, num_planes_per_axial_pos=0x00000001018071a8, axial_pos_to_z_offset=0x00000001018071d8, proj_data_info_cyl_ptr=0x0000000100e07b70, cartesian_grid_info_ptr=0x0000000100e07fc0) at VectorWithOffset.inl:381:15
frame #7: 0x00000001000ab30b test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGrid::DataSymmetriesForBins_PET_CartesianGrid(this=0x0000000101807180, proj_data_info_ptr=0x00007ffeefbfd848, image_info_ptr=0x00007ffeefbfd860, do_symmetry_90degrees_min_phi_v=<unavailable>, do_symmetry_180degrees_min_phi_v=<unavailable>, do_symmetry_swap_segment_v=<unavailable>, do_symmetry_swap_s_v=false, do_symmetry_shift_z=false) at DataSymmetriesForBins_PET_CartesianGrid.cxx:362:45
frame #8: 0x00000001000b3462 test_DataSymmetriesForBins_PET_CartesianGrid`stir::ProjMatrixByBinUsingRayTracing::set_up(this=0x00007ffeefbfd6f0, proj_data_info_ptr_v=<unavailable>, density_info_ptr=0x00007ffeefbfd860) at ProjMatrixByBinUsingRayTracing.cxx:292:68
frame #9: 0x000000010000d2ee test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGridTests::run_tests_all_symmetries(this=0x00007ffeefbfdd30, proj_data_info_sptr=0x00007ffeefbfdd50, density_sptr=0x00007ffeefbfdc40) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:407:32
frame #10: 0x00000001000107bb test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGridTests::run_tests_for_1_projdata(this=0x00007ffeefbfdd30, proj_data_info_sptr=0x00007ffeefbfdd50) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:703:27
frame #11: 0x0000000100011283 test_DataSymmetriesForBins_PET_CartesianGrid`stir::DataSymmetriesForBins_PET_CartesianGridTests::run_tests(this=0x00007ffeefbfdd30) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:815:26
frame #12: 0x000000010029c560 test_DataSymmetriesForBins_PET_CartesianGrid`main(argc=<unavailable>, argv=<unavailable>) at test_DataSymmetriesForBins_PET_CartesianGrid.cxx:852:18
frame #13: 0x00007fff205f2f3d libdyld.dylib`start + 1
turns out there still are some failures with Debug
https://github.com/UCL/STIR/runs/6539520942?check_suite_focus=true
The following tests FAILED:
1 - test_DataSymmetriesForBins_PET_CartesianGrid (Subprocess aborted)
5 - test_PoissonLogLikelihoodWithLinearModelForMeanAndProjData (Subprocess aborted)
6 - test_OSMAPOSL_ray_tracing_matrix (SEGFAULT)
12 - test_proj_data_info_subsets (Subprocess aborted)
note that test_OSMAPOSL_ray_tracing_matrix
and test_DataSymmetriesForBins_PET_CartesianGrid
did work here: https://github.com/UCL/STIR/runs/6538440399?check_suite_focus=true, so intermittent error (the other 2 weren't run in that job)
Unexpectedly. #901 's latest test-run still works for the MacOS job. Of course, this could be because the error is intermittent.
That PR is based on master
at b8956388b5110b47667c2ea499321ae31748d082 (i.e. after merging #1017).
Checking GHA, the last run on master
that worked was for b6f1c2cd84d3645b4e17cdb6e7ced1c8d5b962b5 (merge of #1026). Next update was 770048aa07ef3b97c5094c46b1beb304c28bb5 (just setting version number) which as noted above, the run worked originally, but not on the 2nd attempt.
As the error is intermittent, it is hard to know exactly when this started happening.
could be race condition, see disabling OpenMP in #1055
@robbietuk do you still have a MacOS build with OpenMP? Is ctest
ok?
@robbietuk do you still have a MacOS build with OpenMP? Is
ctest
ok?
Testing...
It looks good on my machine but I dont recall having any issues before #1055
Our MacOS test job now fails consistently,
test_PoissonLogLikelihoodWithLinearModelForMeanAndProjData
,test_blocks_on_cylindrical_projectors
,test_FBP2D
,test_DataSymmetriesForBins_PET_CartesianGrid
etc now all fail really soon withsee e.g. https://github.com/UCL/STIR/runs/6446997086?check_suite_focus=true#step:8:19
However, this occurs also after simply re-running the same job, e.g. the above job is from a re-run of the 5.0.1 commit, which did work originally.
so, this is no a STIR code change, but something else.