ethz-asl / kalibr

The Kalibr visual-inertial calibration toolbox
Other
4.21k stars 1.38k forks source link

Illegal instruction (core dumped) when running `rosrun kalibr kalibr_calibrate_cameras` #631

Closed cod3monk3y closed 10 months ago

cod3monk3y commented 1 year ago

When I run a command like rosrun kalibr kalibr_calibrate_cameras --model pinhole-equi --bag $1 --topics /gopro/image_raw --target april_7x6_7cm_1p4.yaml --show-extraction --bag-freq 10 I always get "Illegal instruction (core dumped)". This happens for all the variations of calibrate and models that I've tried so far.

I am running Ubuntu 18.04, ROS Melodic, and have built Kalibr from source (latest as of ~7/12/2023, git hash f581b27...)

The video frames show up in the visualizer with all april tags visible in all frames. The april grid is a custom 7x6 target with 7cm tags and 0.2 spacing ratio (1.4cm).

image

The code executes Optimizer2.cpp twice. In the first pass runs with options maxIterations: 200, convergenceDeltaX: 0.001, convergenceDeltaJ: 1, takes ~9 iterations and produces a reasonable output for projection and distortion coefficients. GoPro video is 1080p and downscaled 0.5 during conversion to ros bag

Projection initialized to: [ 433.20917666  433.91020906  480.2953882   265.47819541]
Distortion initialized to: [ 0.04030825  0.06855689 -0.079548    0.03388686]

On the second pass through Optimzer2, the options are maxIterations: 50, convergenceDeltaX: 0.001, convergenceDeltaJ: 0.001, and this crashes on the first iteration.

I've narrowed this down to this line in aslam_incremental_calibration/incremental_calibration/src/core/LinearSolver.cpp method LinearSolver::solve

const int status = SuiteSparseQR_numeric<double>(qrTolerance, A_l, _factor, &_cholmod);

Stack trace from core dump confirms it's something in SuiteSparseQR

$ coredumpctl -o core.x dump /usr/bin/python2.7
           PID: 18034 (python)
           UID: 1000 (ubuntu)
           GID: 1000 (ubuntu)
        Signal: 4 (ILL)
     Timestamp: Thu 2023-07-13 14:41:20 UTC (3min 8s ago)
  Command Line: python /home/ubuntu/kalibr_workspace/src/kalibr/aslam_offline_calibration/kalibr/python/kalibr_calibrate_cameras --model pinhole-equi --bag GX010467.bag --topics /gopro/image_raw --target april_7x6_7cm_1p4.yaml --show-extraction --bag-freq 10
    Executable: /usr/bin/python2.7
 Control Group: /user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service
          Unit: user@1000.service
     User Unit: gnome-terminal-server.service
         Slice: user-1000.slice
     Owner UID: 1000 (ubuntu)
       Boot ID: ---
    Machine ID: ---
      Hostname: ---
       Storage: /var/lib/systemd/coredump/core.python.1000.eec1a6110adb41c88d51ecb9f94a6445.18034.1689259280000000.lz4
       Message: Process 18034 (python) of user 1000 dumped core.

                Stack trace of thread 18034:
                #0  0x00007f1f34ddc321 dlarfg_ (libopenblas.so.0)
                #1  0x00007f1f363fcccb _Z10spqr_frontIdElllldllPT_PlPcS1_S1_PdS4_P21cholmod_common_struct (libspqr.so.2)
                #2  0x00007f1f363f94c1 _Z11spqr_kernelIdEvlP9spqr_blobIT_E (libspqr.so.2)
                #3  0x00007f1f363fed6f _Z14spqr_factorizeIdEP12spqr_numericIT_EPP21cholmod_sparse_structldlP13spqr_symbolicP21cholmod_common_struct (libspqr.so.2)
                #4  0x00007f1f363f3be2 _Z21SuiteSparseQR_numericIdEidP21cholmod_sparse_structP27SuiteSparseQR_factorizationIT_EP21cholmod_common_struct (libspqr.so.2)
                #5  0x00007f1eb23f76ad n/a (/home/ubuntu/kalibr_workspace/devel/lib/libincremental_calibration.so)

Core dump was obtained using

$ sudo apt-get install systemd-coredump
$ coredumpctl list | tail
$ coredumpctl -o core.x dump /usr/bin/python2.7

I didn't see any/many issues related to core dumps either in the kalibr repo or SuiteSparse.

Any recommendations on what to try next? I'm going to try the docker container as my next step.

cod3monk3y commented 1 year ago

I also tried wrapping all the cholmod frees in LinearSolver.cpp using a safe-free idiom

#define SAFE_cholmod_l_free_sparse(a) if(a) { cholmod_l_free_sparse(&a, &_cholmod); a = nullptr;}
#define SAFE_cholmod_l_free_dense(a) if(a) { cholmod_l_free_dense(&a, &_cholmod); a = nullptr; }

and added a few missing NULL pointer initializers, in an attempt to ensure that there wasn't a duplicate free or attempt to access freed memory.

But this didn't solve the problem.

cod3monk3y commented 1 year ago

This runs fine using the Dockerfile_ros1_18_04 docker image.

While this works fine for me, I'd still like to know if anyone has any clues on how to fix the Illegal instruction/core dump using built-from-source kalibr.

goldbattle commented 11 months ago

Do you have a bag which produces this? It is tough as this is related to the versions of all dependencies on your system. Does this happen on any of the example bags from the wiki?