DiamondLightSource / durin

BSD 3-Clause "New" or "Revised" License
2 stars 5 forks source link

Sets return status != 0 in XDS / SEGV? #17

Closed graeme-winter closed 3 years ago

graeme-winter commented 4 years ago

Though the plugin "works fine" keep getting

 TASK     cpu time (sec)    elapsed wall-clock time (sec)
   1        3647.1                   554.5
 [generic_data_plugin] - INFO - 'call generic_close()'

 Total elapsed wall-clock time for XDS      557.7 sec
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
xds_par            000000010321A8B4  Unknown               Unknown  Unknown
libsystem_platfor  00007FFF70F49B5D  Unknown               Unknown  Unknown

at the end of processing which is untidy at best

graeme-winter commented 4 years ago

OK, this appears to only manifest itself when using xds_par not xds so it is almost certainly to do with (lack of) thread safety.

graeme-winter commented 4 years ago
 Total elapsed wall-clock time for XDS        7.7 sec
Process 78252 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x101e741e0)
    frame #0: 0x0000000101e741e0
error: memory read failed for 0x101e74000
  thread #3, stop reason = EXC_BAD_ACCESS (code=1, address=0x101e741e0)
    frame #0: 0x0000000101e741e0
error: memory read failed for 0x101e74000
Target 0: (xds_par) stopped.
graeme-winter commented 4 years ago

Looks like the error is purely inside the thread library?

 WEAK SPOTS OMITTED                                      0
 NUMBER OF DIFFRACTION SPOTS ACCEPTED                  216

 total elapsed wall-clock time for COLSPOT       2.1 sec

 TASK     cpu time (sec)    elapsed wall-clock time (sec)
   1          13.3                     2.0
 [generic_data_plugin] - INFO - 'call generic_close()'

 Total elapsed wall-clock time for XDS        7.8 sec
Process 78877 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x101e96390)
    frame #0: 0x0000000101e96390
error: memory read failed for 0x101e96200
Target 0: (xds_par) stopped.
(lldb) thread list
Process 78877 stopped
  thread #1: tid = 0x1290416, 0x00007fff70e959de libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
* thread #2: tid = 0x129042d, 0x0000000101e96390, stop reason = EXC_BAD_ACCESS (code=1, address=0x101e96390)
  thread #3: tid = 0x129042e, 0x00007fff70e9686a libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #4: tid = 0x129042f, 0x00007fff70e9686a libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #5: tid = 0x1290430, 0x00007fff70e9686a libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #6: tid = 0x1290431, 0x00007fff70e9686a libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #7: tid = 0x1290432, 0x00007fff70e9686a libsystem_kernel.dylib`__psynch_cvwait + 10
  thread #8: tid = 0x1290433, 0x00007fff70e9686a libsystem_kernel.dylib`__psynch_cvwait + 10
(lldb) t 2
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x101e96390)
    frame #0: 0x0000000101e96390
error: memory read failed for 0x101e96200
(lldb) thread backtrace
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x101e96390)
  * frame #0: 0x0000000101e96390
    frame #1: 0x00007fff70f52660 libsystem_pthread.dylib`_pthread_tsd_cleanup + 476
    frame #2: 0x00007fff70f55655 libsystem_pthread.dylib`_pthread_exit + 70
    frame #3: 0x00007fff70f522f6 libsystem_pthread.dylib`_pthread_body + 137
    frame #4: 0x00007fff70f55249 libsystem_pthread.dylib`_pthread_start + 66
    frame #5: 0x00007fff70f5140d libsystem_pthread.dylib`thread_start + 13

the full stack for this thread which gave EXC_BAD_ACCESS is within libsystem_pthread 🤔

graeme-winter commented 4 years ago

Of course, the process which fails is actually mcolspot_par not xds_par so intercept in forkxds and run inside a debugger - same end game though -

Grey-Area durin-segv :) $ lldb `which mcolspot_par`
(lldb) target create "/Users/graeme/xtal/XDS/mcolspot_par"
Current executable set to '/Users/graeme/xtal/XDS/mcolspot_par' (x86_64).
(lldb) run
Process 80562 launched: '/Users/graeme/xtal/XDS/mcolspot_par' (x86_64)
2^D
] master_file=/Volumes/Blue/Data/i03-ins-fdp-small-i/insu_13_1_master.h5
 [generic_data_plugin] - INFO - generic_open
       + library          = </Users/graeme/xtal/XDS/durin-plugin.so>
       + template_name    = <
 /Volumes/Blue/Data/i03-ins-fdp-small-i/insu_13_1_master.h5>
       + dll_filename     = </Users/graeme/xtal/XDS/durin-plugin.so>
       + image_data_filename   = <
 /Volumes/Blue/Data/i03-ins-fdp-small-i/insu_13_1_master.h5>
 [generic_data_plugin] - INFO - generic_get_header
INFO(1:5)=vendor/major version/minor version/patch/timestamp=   1   0   0   0          -1
 generic_getfrm: data are from Dectris
 [generic_data_plugin] - INFO - 'call generic_close()'
Process 80562 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x10fdf9390)
    frame #0: 0x000000010fdf9390
error: memory read failed for 0x10fdf9200
Target 0: (mcolspot_par) stopped.
(lldb) thread list
Process 80562 stopped
  thread #1: tid = 0x12c5251, 0x00007fff70e959de libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
* thread #2: tid = 0x12c527c, 0x000000010fdf9390, stop reason = EXC_BAD_ACCESS (code=1, address=0x10fdf9390)
(lldb) thread backtrace
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x10fdf9390)
  * frame #0: 0x000000010fdf9390
    frame #1: 0x00007fff70f52660 libsystem_pthread.dylib`_pthread_tsd_cleanup + 476
    frame #2: 0x00007fff70f55655 libsystem_pthread.dylib`_pthread_exit + 70
    frame #3: 0x00007fff70f522f6 libsystem_pthread.dylib`_pthread_body + 137
    frame #4: 0x00007fff70f55249 libsystem_pthread.dylib`_pthread_start + 66
    frame #5: 0x00007fff70f5140d libsystem_pthread.dylib`thread_start + 13
graeme-winter commented 4 years ago

https://strucbio.biologie.uni-konstanz.de/xdswiki/index.php/LIB

Should try with a threaded driver application to see if problem can be recreated

graeme-winter commented 3 years ago

Page above indicates that "this just happens" 🤔

Also driver code only FORTRAN which is not super helpful...

graeme-winter commented 3 years ago

Compiled the test_generic_host program and I can reproduce the issue e.g. with OMP_NUM_THREAD=1 the program works correctly

Grey-Area driver :( $ OMP_NUM_THREADS=1 ./test_generic_host < driver.in 
 enter parameter of LIB= keyword:
 enter parameter of NAME_TEMPLATE_OF_DATA_FRAMES= keyword:
 enter parameters of the DATA_RANGE= keyword:
 master_file=/Users/graeme/data/i03-screen19/Protk_1/Protk_1_8_master.h5
 [generic_data_plugin] - INFO - generic_open
       + library          = </Users/graeme/xtal/XDS/durin-plugin.so>
       + template_name    = </Users/graeme/data/i03-screen19/Protk_1/Protk_1_8_master.h5>
       + dll_filename     = </Users/graeme/xtal/XDS/durin-plugin.so>
       + image_data_filename   = </Users/graeme/data/i03-screen19/Protk_1/Protk_1_8_master.h5>
 [generic_data_plugin] - INFO - generic_get_header
nx,ny,nbyte,qx,qy,number_of_frames=  4148  4362     2  0.000075  0.000075   150
INFO(1:5)=vendor/major version/minor version/patch/timestamp=   1   0   0   0          -1
 generic_getfrm: data are from Dectris
 average counts:   1.62359679    
 [generic_data_plugin] - INFO - 'call generic_close()'

but without I get a SEGV

Grey-Area driver :) $ OMP_NUM_THREADS=2 ./test_generic_host < driver.in 
 enter parameter of LIB= keyword:
 enter parameter of NAME_TEMPLATE_OF_DATA_FRAMES= keyword:
 enter parameters of the DATA_RANGE= keyword:
 master_file=/Users/graeme/data/i03-screen19/Protk_1/Protk_1_8_master.h5
 [generic_data_plugin] - INFO - generic_open
       + library          = </Users/graeme/xtal/XDS/durin-plugin.so>
       + template_name    = </Users/graeme/data/i03-screen19/Protk_1/Protk_1_8_master.h5>
       + dll_filename     = </Users/graeme/xtal/XDS/durin-plugin.so>
       + image_data_filename   = </Users/graeme/data/i03-screen19/Protk_1/Protk_1_8_master.h5>
 [generic_data_plugin] - INFO - generic_get_header
nx,ny,nbyte,qx,qy,number_of_frames=  4148  4362     2  0.000075  0.000075   150
INFO(1:5)=vendor/major version/minor version/patch/timestamp=   1   0   0   0          -1
 generic_getfrm: data are from Dectris
 average counts:   1.62359643    
 [generic_data_plugin] - INFO - 'call generic_close()'

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x10dd589cc
#1  0x10dd57dc5
#2  0x7fff6bcb45fc
Segmentation fault: 11
graeme-winter commented 3 years ago

Now resolved on the XDS side in the latest release, by changing the manner in which the plugin code is unloaded at the end of execution.