Social-Evolution-and-Behavior / anTraX

anTraX: high throughput tracking of color-tagged insects
https://antrax.readthedocs.io/
GNU General Public License v3.0
17 stars 3 forks source link

What causes "Segmentation violation" during the tracking stage? #19

Closed janamach closed 3 years ago

janamach commented 3 years ago

I ran into an error while running tracking on multiple files on HPC and I was able to reproduce the same error by starting the job with an identical command, which was:

antrax track H1CN0304/ --movlist 1-5,13,14,21-39,43,45,49-59 --hpc --hpc-options partition=single,email=jana.mach@bio.uni-freiburg.de,cpus=10,mem-per-cpu=5000,time=09:00:00

All instances failed with a similar error (see below) twice. However, with a slightly different command all movies were tracked successfully:

antrax track H1CN0304/ --movlist 1-5,13,14,21-39,43,45,49-59 --hpc --hpc-options partition=fat,email=jana.mach@bio.uni-freiburg.de,cpus=8,mem-per-cpu=5000,time=11:00:00

The difference between the two commands was the partition used (single vs fat) and the number of cpus (10 vs 8). The partition called single is usually less busy and by default I can run many more tasks in parallel. I was wondering if there was an explanation for the error with the first command? Am I overdoing it with the number of CPUs? Still trying to get a feeling for this :-)

(base) [fr_jm1121@uc2n994 ~]$ cat ants/H1CN0304/antrax/logs2/matlab_track_m_1.log 
07:18:55 -I- Reading video information from file
07:19:02 -I- Linking method is link_blobs
07:19:03 -I- Starting the frame loop
07:19:03 -I- opening video file H1CN0304/videos//01_12/H1CN0304_01.mp4
07:23:01 -I- Finished tracking frame #1000 (1000/90000)
07:26:58 -I- Finished tracking frame #2000 (2000/90000)
07:30:46 -I- Finished tracking frame #3000 (3000/90000)
07:34:38 -I- Finished tracking frame #4000 (4000/90000)
07:37:55 -I- Finished tracking frame #5000 (5000/90000)
07:41:23 -I- Finished tracking frame #6000 (6000/90000)
07:45:00 -I- Finished tracking frame #7000 (7000/90000)
07:48:56 -I- Finished tracking frame #8000 (8000/90000)
07:53:02 -I- Finished tracking frame #9000 (9000/90000)
07:56:45 -I- Finished tracking frame #10000 (10000/90000)
08:00:17 -I- Finished tracking frame #11000 (11000/90000)
08:03:32 -I- Finished tracking frame #12000 (12000/90000)
08:06:59 -I- Finished tracking frame #13000 (13000/90000)
08:10:45 -I- Finished tracking frame #14000 (14000/90000)
08:14:49 -I- Finished tracking frame #15000 (15000/90000)
08:18:59 -I- Finished tracking frame #16000 (16000/90000)
08:23:09 -I- Finished tracking frame #17000 (17000/90000)
08:27:27 -I- Finished tracking frame #18000 (18000/90000)
08:31:56 -I- Finished tracking frame #19000 (19000/90000)
08:35:45 -I- Finished tracking frame #20000 (20000/90000)
08:39:40 -I- Finished tracking frame #21000 (21000/90000)
08:44:15 -I- Finished tracking frame #22000 (22000/90000)
08:48:02 -I- Finished tracking frame #23000 (23000/90000)
08:51:45 -I- Finished tracking frame #24000 (24000/90000)
08:55:47 -I- Finished tracking frame #25000 (25000/90000)
09:00:07 -I- Finished tracking frame #26000 (26000/90000)
09:04:35 -I- Finished tracking frame #27000 (27000/90000)
09:08:14 -I- Finished tracking frame #28000 (28000/90000)
09:12:14 -I- Finished tracking frame #29000 (29000/90000)
09:16:03 -I- Finished tracking frame #30000 (30000/90000)
09:20:33 -I- Finished tracking frame #31000 (31000/90000)
09:25:10 -I- Finished tracking frame #32000 (32000/90000)
09:28:56 -I- Finished tracking frame #33000 (33000/90000)
09:32:44 -I- Finished tracking frame #34000 (34000/90000)

--------------------------------------------------------------------------------
       Segmentation violation detected at Fri Mar 26 09:36:04 2021 +0100
--------------------------------------------------------------------------------

Configuration:
  Crash Decoding           : Disabled - No sandbox or build area path
  Crash Mode               : continue (default)
  Default Encoding         : UTF-8
  Deployed                 : true
  GNU C Library            : 2.28 stable
  Graphics Driver          : Unknown hardware 
  Graphics card 1          : 0x102b ( 0x102b ) 0x538 Version 0.0.0.0 (0-0-0)
  Java Version             : Java 1.8.0_181-b13 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
  MATLAB Architecture      : glnxa64
  MATLAB Entitlement ID    : Unknown
  MATLAB Root              : /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96
  MATLAB Version           : 9.6.0.1472908 (R2019a) Update 9
  OpenGL                   : hardware
  Operating System         : "Red Hat Enterprise Linux release 8.2 (Ootpa)"
  Process ID               : 1450793
  Processor ID             : x86 Family 6 Model 85 Stepping 7, GenuineIntel
  Session Key              : 9ae81582-ff3f-4ec2-a5e5-5e2a1ecbd7d3
  Static TLS mitigation    : Enabled: Full
  Window System            : The X.Org Foundation (12004000), display 10.0.3.226:14.0

Fault Count: 1

Abnormal termination:
Segmentation violation

Register State (from fault):
  RAX = 00000000000000c0  RBX = 0000148f4e8d1000
  RCX = 00000000dc40bce8  RDX = 000000008d000500
  RSP = 0000148eafffbc60  RBP = 000000000000000a
  RSI = 0000000000000062  RDI = 00000000dc40bce8

   R8 = 0000000000000467   R9 = 0000148f527fc710
  R10 = 000000000000000b  R11 = 0000000000000202
  R12 = 0000000000000000  R13 = 0000000000000000
  R14 = 0000148f527fd6b8  R15 = 0000148e30e620b0

  RIP = 0000148f531c20b6  EFL = 0000000000010202

   CS = 0033   FS = 0000   GS = 0000

Stack Trace (from fault):
[  0] 0x0000148f531c20b6                        /lib64/ld-linux-x86-64.so.2+00041142
[  1] 0x0000148f531c2c31                        /lib64/ld-linux-x86-64.so.2+00044081
[  2] 0x0000148f52571ddc                                   /lib64/libc.so.6+01277404
[  3] 0x0000148f52fb5328                                  /lib64/libdl.so.2+00004904
[  4] 0x0000148f52572414                                   /lib64/libc.so.6+01278996 _dl_catch_exception+00000132
[  5] 0x0000148f525724d3                                   /lib64/libc.so.6+01279187 _dl_catch_error+00000051
[  6] 0x0000148f52fb5939                                  /lib64/libdl.so.2+00006457
[  7] 0x0000148f52fb5393                                  /lib64/libdl.so.2+00005011 dlsym+00000099
[  8] 0x0000148f4b2b6054 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libut.so+00376916 utFindSymbolInLibrary+00000308
[  9] 0x0000148e54097bf5 /pfs/data5/home/fr/fr_fr/fr_jm1121/.mcrCache9.6/antrax0/toolbox/images/images/private/../../../../bin/glnxa64/libmwhalideBinder.so+00011253 _ZN12HalideBinder16getMethodPointerEPKcS1_+00000613
[ 10] 0x0000148e5409c982 /pfs/data5/home/fr/fr_fr/fr_jm1121/.mcrCache9.6/antrax0/toolbox/images/images/private/morphmex_halide.mexa64+00006530 mexFunction+00000978
[ 11] 0x0000148f488f7d60 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmex.so+00544096
[ 12] 0x0000148f488f8d73 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmex.so+00548211
[ 13] 0x0000148f488e471c /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmex.so+00464668
[ 14] 0x0000148f482d457f /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_dispatcher.so+01082751 _ZN8Mfh_file20dispatch_file_commonEMS_FviPP11mxArray_tagiS2_EiS2_iS2_+00000207
[ 15] 0x0000148f482d607e /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_dispatcher.so+01089662
[ 16] 0x0000148f482d65c1 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_dispatcher.so+01091009 _ZN8Mfh_file8dispatchEiPSt10unique_ptrI11mxArray_tagN6matrix6detail17mxDestroy_deleterEEiPPS1_+00000033
[ 17] 0x0000148f467a09ea /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+13859306
[ 18] 0x0000148f467a637f /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+13882239
[ 19] 0x0000148f4689e971 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+14899569
[ 20] 0x0000148f4680b044 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+14295108
[ 21] 0x0000148f46831a3d /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+14453309
[ 22] 0x0000148f46112700 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+06985472
[ 23] 0x0000148f460fca92 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+06896274
[ 24] 0x0000148f46100f73 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+06913907
[ 25] 0x0000148f46660571 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+12547441
[ 26] 0x0000148f4678ef97 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+13787031
[ 27] 0x0000148f4678f3d5 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+13788117
[ 28] 0x0000148f482d457f /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_dispatcher.so+01082751 _ZN8Mfh_file20dispatch_file_commonEMS_FviPP11mxArray_tagiS2_EiS2_iS2_+00000207
[ 29] 0x0000148f482d6713 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_dispatcher.so+01091347 _ZN8Mfh_file19dispatch_with_reuseEiPP11mxArray_tagiS2_+00000323
[ 30] 0x0000148f468cdf2e /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+15093550
[ 31] 0x0000148f4660c8d6 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+12204246
[ 32] 0x0000148f4660ca4c /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+12204620
[ 33] 0x0000148f466ace48 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+12861000
[ 34] 0x0000148f466ade78 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_lxe.so+12865144
[ 35] 0x0000148f4989ac60 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwm_interpreter.so+01289312 _Z44inCallFcnWithTrapInDesiredWSAndPublishEventsiPP11mxArray_tagiS1_PKcbP15inWorkSpace_tag+00000080
[ 36] 0x0000148f48a4ee7d /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwiqm.so+00712317 _ZN3iqm15BaseFEvalPlugin7executeEP15inWorkSpace_tag+00000525
[ 37] 0x0000148f4a2ef749 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwmcr.so+00943945
[ 38] 0x0000148f48a43f2d /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwiqm.so+00667437
[ 39] 0x0000148f48a26eba /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwiqm.so+00548538
[ 40] 0x0000148f48a27b2f /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwiqm.so+00551727
[ 41] 0x0000148f4a2d2e8e /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwmcr.so+00827022
[ 42] 0x0000148f4a2d3648 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwmcr.so+00829000
[ 43] 0x0000148f4a2ccd92 /pfs/data5/home/fr/fr_fr/fr_jm1121/MATLAB/v96/bin/glnxa64/libmwmcr.so+00802194
[ 44] 0x0000148f52d9c2de                             /lib64/libpthread.so.0+00033502
[ 45] 0x0000148f52535e83                                   /lib64/libc.so.6+01031811 clone+00000067
[ 46] 0x0000000000000000                                   <unknown-module>+00000000

This error was detected while a MEX-file was running. If the MEX-file
is not an official MathWorks function, please examine its source code
for errors. Please consult the External Interfaces Guide for information
on debugging MEX-files.
** This crash report has been saved to disk as /home/fr/fr_fr/fr_jm1121/matlab_crash_dump.1450793-1 **

MATLAB is exiting because of fatal error
asafgal commented 3 years ago

Matlab segmentation faults are a pain to debug :(

As it happens only on a specific partition, with the same data and code, I would say something in the machines on that partition causes them to not play nicely with matlab. It also happens mid-tracking, so my guess is a memory issue of some sort, although anTraX memory consumption is fixed during execution and doesn't "explode". There is not much I can do to debug this, and honestly I would say to go with the partition you know to work...

And yes, your settings are overkill. The tracking step is a two-thread program, one to read frames from the video, and the other to analyze them. Therefore, it never uses more than 2 cpus. Usually one thread needs to wait for the other, so you'll see usage between 100%-200% cpu. Memory usage is more variable and depends factors such as frame size, number of blobs detected, etc. 4GB per task is usually more than enough. If you want to optimize, you can check the actual memory usage (either on your local machine or your cluster) and adapt your settings.

janamach commented 3 years ago

I agree that this error is too rare to spend time to try to debug it, but it would be good to have an idea what not to do to try to avoid it. I used that same partition again for tracking, but had less tasks run in parallel and the segmentation error did not occur. I wonder if I created a bottleneck somewhere by running too many tasks at the same time.

I remember you mentioned that tracking was a two-thread thing before and somehow ignored it :-/ I guess I made many cpus idle for no reason, not good. :-/

On the up side, I did manage to go through all the steps on HPC, from installing all dependencies locally to generating csv's from my own data. This is a big achievement for me, as the whole concept of HPC was largely unknown to me up until now. Thank you again for all the help and for being so patient with me :)