draios / sysdig

Linux system exploration and troubleshooting tool with first class support for containers
http://www.sysdig.com/
Other
7.72k stars 726 forks source link

Immediate segfaults on Debian & Ubuntu #1475

Closed ig0rsky closed 4 years ago

ig0rsky commented 5 years ago

Stacktrace:

#0  0x0000000000651494 in sinsp_evt::get_ts() ()
#1  0x000000000072a1b5 in sinsp::next(sinsp_evt**) ()
#2  0x000000000061d834 in do_inspect(sinsp*, unsigned long, unsigned long, bool, bool, bool, bool, sinsp_filter*, std::vector<summary_table_entry, std::allocator<summary_table_entry> >&, sinsp_evt_formatter*) ()
#3  0x0000000000620871 in sysdig_init(int, char**) ()
#4  0x000000000060ad40 in main ()

Module: sysdig Version: 0.26.1 Kernel: 4.9.0-8-amd64 (x86_64)

Tested with debian stretch and buster and used automatic installer in both cases. Getting segfault almost immediately after starting sysdig on the command line. Same thing is happening with csysdig as well.

The segfaults do not happen if I write to a file. I think it has to do with not being able to write fast enough to stdout when there are too many things to output at the same time. :) Because the segfault occurs only when the system is heavily stressed with containers. @fntlnz

EDIT: https://github.com/draios/sysdig/releases/tag/0.26.5 Fixes this issue.

fntlnz commented 5 years ago

Hi @ig0rsky I have a doubt, did you build Sysdig yourself or it's installed from a distributed package/binary ?

Can you post a crash dump? It could be helpful to analyze the situation with gdb

ig0rsky commented 5 years ago

Hey @fntlnz, I haven't built the system myself. I installed using the automatic installer.

necsf commented 5 years ago

I also encountered the same problem, do not know how to solve this problem

#0  0x0000000000651494 in sinsp_evt::get_ts() ()
[Current thread is 1 (Thread 0x7f42ab2fb880 (LWP 120))]
(gdb) bt
#0  0x0000000000651494 in sinsp_evt::get_ts() ()
#1  0x000000000072a1b5 in sinsp::next(sinsp_evt**) ()
#2  0x000000000061d834 in do_inspect(sinsp*, unsigned long, unsigned long, bool, bool, bool, bool, sinsp_filter*, std::vector<summary_table_entry, std::allocator<summary_table_entry> >&, sinsp_evt_formatter*) ()
#3  0x0000000000620871 in sysdig_init(int, char**) ()
#4  0x000000000060ad40 in main ()
gnosek commented 4 years ago

@ig0rsky, @necsf, can you share some details about your setup? How big is your machine? What was the workload that you were running that made sysdig crash? How did you run sysdig?

dnwe commented 4 years ago

I was hitting a similar problem on a k8s worker with the pre-built sysdig 0.26.4 .deb on Ubuntu 18.04 (bionic) so I compiled my own debug release from the tag in git to try and get a full stacktrace that might help.

Here's the segfault and full backtrace after just running gdb --args /usr/bin/sysdig

Thread 1 "sysdig" received signal SIGSEGV, Segmentation fault.
0x0000555555bdda1a in sinsp_evt::get_ts (this=0x5555568c7a30) at /usr/src/sysdig-0.26.4/userspace/libsinsp/event.h:224
(gdb) set print elements 0
(gdb) bt full
#0  0x0000555555bdda1a in sinsp_evt::get_ts (this=0x5555568c7a30) at /usr/src/sysdig-0.26.4/userspace/libsinsp/event.h:224
No locals.
#1  0x0000555555d2eceb in sinsp::next (this=0x5555568c79b0, puevt=0x7fffffffd480) at /usr/src/sysdig-0.26.4/userspace/libsinsp/sinsp.cpp:1129
        evt = 0x5555568c7a30
        res = 0
        ts = 1573057163999999999
        nfdr = 0
        __PRETTY_FUNCTION__ = "virtual int32_t sinsp::next(sinsp_evt**)"
#2  0x0000555555b8e4d0 in do_inspect (inspector=0x5555568c79b0, cnt=18446744073709551615, duration_to_tot_ns=0, quiet=false, json=false, do_flush=false, print_progress=false, display_filter=0x0,
    summary_table=std::vector of length 0, capacity 0, formatter=0x7fffffffd860) at /usr/src/sysdig-0.26.4/userspace/sysdig/sysdig.cpp:602
        retval = {m_nevts = 504, m_time = 0}
        res = 0
        ev = 0x5555568c7eb0
        line = "17 16:19:23.970951000 0 container:0ecc14b0e538 (-1) > container json={\"container\":{\"Mounts\":[{\"Destination\":\"/host/proc\",\"Mode\":\"\",\"Propagation\":\"private\",\"RW\":false,\"Source\":\"/proc\"},{\"Destination\":\"/host/sys\",\"Mode\":\"\",\"Propagation\":\"private\",\"RW\":false,\"Source\":\"/sys\"},{\"Destination\":\"/tmp\",\"Mode\":\"\",\"Propagation\":\"private\",\"RW\":true,\"Source\":\"/var/data/kubelet/pods/d79e215c-ff1e-11e9-bb0a-121c3d9fc0af/volumes/kubernetes.io~empty-dir/tmp-volume\"},{\"Destination\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"Mode\":\"\",\"Propagation\":\"private\",\"RW\":false,\"Source\":\"/var/data/kubelet/pods/d79e215c-ff1e-11e9-bb0a-121c3d9fc0af/volumes/kubernetes.io~secret/default-token-fhhzk\"},{\"Destination\":\"/etc/hosts\",\"Mode\":\"\",\"Propagation\":\"private\",\"RW\":true,\"Source\":\"/var/data/kubelet/pods/d79e215c-ff1e-11e9-bb0a-121c3d9fc0af/etc-hosts\"},{\"Destination\":\"/dev/termination-log\",\"Mode\":\"\",\"Propagation\":\"private\",\"RW\":true,\"Source\":\"/var/data/kubelet/pods/d79e215c-ff1e-11e9-bb0a-121c3d9fc0af/containers/foo/c11679b1\"}],\"cpu_period\":20000,\"cpu_quota\":8000,\"cpu_shares\":102,\"cpuset_cpu_count\":0,\"env\":[],\"id\":\"0ecc14b0e538\",\"image\":\"example.com/namespace/foo:49\",\"imagedigest\":\"sha256:a31fb785174959a88c10e8c79185aef340c30a62a61e1169e03c982b8cd82665\",\"imageid\":\"d4568965b5e65d6e54171078cdd56fcdd1bc312cb4f239326f96c354dcf1e208\",\"imagerepo\":\"example.com/namespace/foo\",\"imagetag\":\"49\",\"ip\":\"172.30.35.85\",\"is_pod_sandbox\":false,\"labels\":{\"io.kubernetes.container.name\":\"foo\",\"io.kubernetes.pod.name\":\"foo-v992z\",\"io.kubernetes.pod.namespace\":\"default\",\"io.kubernetes.pod.uid\":\"d79e215c-ff1e-11e9-bb0a-121c3d9fc0af\"},\"memory_limit\":268435456,\"metadata_deadline\":0,\"name\":\"foo\",\"port_mappings\":[],\"privileged\":false,\"swap_limit\":268435456,\"type\":7}}\n "
        last_printed_progress_pct = 0
        duration_start = 1573057163465164000
#3  0x0000555555b913be in sysdig_init (argc=1, argv=0x7fffffffdad8) at /usr/src/sysdig-0.26.4/userspace/sysdig/sysdig.cpp:1594
        cstats = {n_evts = 140737488344840, n_drops = 7, n_drops_buffer = 0, n_drops_pf = 93825012470792, n_drops_bug = 2096, n_preemptions = 18446744073709534808, n_suppressed = 32, n_tids_suppressed = 343597383809}
        j = 0
        formatter = {m_tokens = std::vector of length 15, capacity 16 = {{first = "evt.num", second = 0x5555569108a0}, {first = "", second = 0x555556911f60}, {first = "evt.outputtime", second = 0x555556912710}, {first = "",
              second = 0x555556913d90}, {first = "evt.cpu", second = 0x555556914520}, {first = "", second = 0x555556915c60}, {first = "proc.name", second = 0x555556916410}, {first = "", second = 0x555556916ad0}, {first = "thread.tid",
              second = 0x555556917150}, {first = "", second = 0x555556917ac0}, {first = "evt.dir", second = 0x555556918190}, {first = "", second = 0x555556919780}, {first = "evt.type", second = 0x555556919f50}, {first = "",
              second = 0x55555691b520}, {first = "evt.info", second = 0x55555691bc90}}, m_tokenlens = std::vector of length 15, capacity 16 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, m_inspector = 0x5555568c79b0,
          m_require_all_values = false, m_chks_to_free = std::vector of length 15, capacity 16 = {0x5555569108a0, 0x555556911f60, 0x555556912710, 0x555556913d90, 0x555556914520, 0x555556915c60, 0x555556916410, 0x555556916ad0,
            0x555556917150, 0x555556917ac0, 0x555556918190, 0x555556919780, 0x555556919f50, 0x55555691b520, 0x55555691bc90}, m_root = {static nullRef = @0x555556304e10, static null = {static nullRef = @0x555556304e10,
              static null = <same as static member of an already seen type>, static minLargestInt = -9223372036854775808, static maxLargestInt = 9223372036854775807, static maxLargestUInt = 18446744073709551615,
              static minInt = -2147483648, static maxInt = 2147483647, static maxUInt = 4294967295, static minInt64 = -9223372036854775808, static maxInt64 = 9223372036854775807, static maxUInt64 = 18446744073709551615, value_ = {
                int_ = 0, uint_ = 0, real_ = 0, bool_ = false, string_ = 0x0, map_ = 0x0}, type_ = Json::nullValue, allocated_ = 0, comments_ = 0x0}, static minLargestInt = -9223372036854775808,
            static maxLargestInt = 9223372036854775807, static maxLargestUInt = 18446744073709551615, static minInt = -2147483648, static maxInt = 2147483647, static maxUInt = 4294967295, static minInt64 = -9223372036854775808,
            static maxInt64 = 9223372036854775807, static maxUInt64 = 18446744073709551615, value_ = {int_ = 140737488345464, uint_ = 140737488345464, real_ = 6.953355807347658e-310, bool_ = 120, string_ = 0x7fffffffd978 "",
              map_ = 0x7fffffffd978}, type_ = Json::nullValue, allocated_ = 0, comments_ = 0x0}, m_writer = {<Json::Writer> = {_vptr.Writer = 0x5555567c11f8 <vtable for Json::FastWriter+16>}, document_ = "",
            yamlCompatiblityEnabled_ = false}}
        filter = ""
        res = {m_res = 0, m_next_run_args = std::vector of length 0, capacity 0}
        inspector = 0x5555568c79b0
        infiles = std::vector of length 0, capacity 0
        outfile = ""
        op = -1
        cnt = 18446744073709551615
        quiet = false
        is_filter_display = false
        verbose = false
        list_flds = false
        list_flds_markdown = false
        print_progress = false
        compress = false
        event_buffer_format = sinsp_evt::PF_NORMAL
        display_filter = 0x0
        duration = 0.35439100000000001
        duration_to_tot = 0
        cinfo = {m_nevts = 0, m_time = 0}
        output_format = "*%evt.num %evt.outputtime %evt.cpu %proc.name (%thread.tid) %evt.dir %evt.type %evt.info"
        snaplen = 0
        long_index = 0
        n_filterargs = 0
        jflag = false
        unbuf_flag = false
        filter_proclist_flag = false
        cname = ""
        summary_table = std::vector of length 0, capacity 0
        k8s_api = 0x0
        k8s_api_cert = 0x0
        mesos_api = 0x0
        force_tracers_capture = false
        page_faults = false
        bpf = false
        bpf_probe = ""
        suppress_comms = std::set with 0 elements
        cri_socket_path = ""
        duration_seconds = 0
        rollover_mb = 0
        file_limit = 0
        event_limit = 0
        long_options = {{name = 0x5555562a7a36 "print-ascii", has_arg = 0, flag = 0x0, val = 65}, {name = 0x5555562a7a42 "print-base64", has_arg = 0, flag = 0x0, val = 98}, {name = 0x5555562a7a4f "bpf", has_arg = 2, flag = 0x0,
            val = 66}, {name = 0x5555562a7a53 "chisel", has_arg = 1, flag = 0x0, val = 99}, {name = 0x5555562a7a5a "list-chisels", has_arg = 0, flag = 0x0, val = 0}, {name = 0x5555562a7a67 "cri", has_arg = 1, flag = 0x0, val = 0}, {
            name = 0x5555562a7a6b "cri-timeout", has_arg = 1, flag = 0x0, val = 0}, {name = 0x5555562a7a77 "displayflt", has_arg = 0, flag = 0x0, val = 100}, {name = 0x5555562a7a82 "debug", has_arg = 0, flag = 0x0, val = 68}, {
            name = 0x5555562a7a88 "exclude-users", has_arg = 0, flag = 0x0, val = 69}, {name = 0x5555562a7a96 "event-limit", has_arg = 1, flag = 0x0, val = 101}, {name = 0x5555562a7aa2 "fatfile", has_arg = 0, flag = 0x0, val = 70}, {
            name = 0x5555562a7aaa "filter-proclist", has_arg = 0, flag = 0x0, val = 0}, {name = 0x5555562a7aba "seconds", has_arg = 1, flag = 0x0, val = 71}, {name = 0x5555562a7ac2 "help", has_arg = 0, flag = 0x0, val = 104}, {
            name = 0x5555562a7ac7 "chisel-info", has_arg = 1, flag = 0x0, val = 105}, {name = 0x5555562a7ad3 "file-size", has_arg = 1, flag = 0x0, val = 67}, {name = 0x5555562a7add "json", has_arg = 0, flag = 0x0, val = 106}, {
            name = 0x5555562a7ae2 "k8s-api", has_arg = 1, flag = 0x0, val = 107}, {name = 0x5555562a7aea "k8s-api-cert", has_arg = 1, flag = 0x0, val = 75}, {name = 0x5555562a7af7 "large-environment", has_arg = 0, flag = 0x0, val = 0}, {
            name = 0x5555562a7b09 "list", has_arg = 0, flag = 0x0, val = 108}, {name = 0x5555562a7b0e "list-events", has_arg = 0, flag = 0x0, val = 76}, {name = 0x5555562a7b1a "list-markdown", has_arg = 0, flag = 0x0, val = 0}, {
            name = 0x5555562a7b28 "mesos-api", has_arg = 1, flag = 0x0, val = 109}, {name = 0x5555562a7b32 "numevents", has_arg = 1, flag = 0x0, val = 110}, {name = 0x5555562a7b3c "page-faults", has_arg = 0, flag = 0x0, val = 0}, {
            name = 0x5555562a7b48 "progress", has_arg = 1, flag = 0x0, val = 80}, {name = 0x5555562a7b51 "print", has_arg = 1, flag = 0x0, val = 112}, {name = 0x5555562a7b57 "quiet", has_arg = 0, flag = 0x0, val = 113}, {
            name = 0x5555562a7b5d "resolve-ports", has_arg = 0, flag = 0x0, val = 82}, {name = 0x5555562a7b6b "readfile", has_arg = 1, flag = 0x0, val = 114}, {name = 0x5555562a7b74 "snaplen", has_arg = 1, flag = 0x0, val = 115}, {
            name = 0x5555562a7b7c "summary", has_arg = 0, flag = 0x0, val = 83}, {name = 0x5555562a7b84 "suppress-comm", has_arg = 1, flag = 0x0, val = 85}, {name = 0x5555562a7b92 "timetype", has_arg = 1, flag = 0x0, val = 116}, {
            name = 0x5555562a7b9b "force-tracers-capture", has_arg = 1, flag = 0x0, val = 84}, {name = 0x5555562a7bb1 "unbuffered", has_arg = 0, flag = 0x0, val = 0}, {name = 0x5555562a7bbc "verbose", has_arg = 0, flag = 0x0, val = 118},
          {name = 0x5555562a7bc4 "version", has_arg = 0, flag = 0x0, val = 0}, {name = 0x5555562a7bcc "writefile", has_arg = 1, flag = 0x0, val = 119}, {name = 0x5555562a7bd6 "limit", has_arg = 1, flag = 0x0, val = 87}, {
            name = 0x5555562a7bdc "print-hex", has_arg = 0, flag = 0x0, val = 120}, {name = 0x5555562a7be6 "print-hex-ascii", has_arg = 0, flag = 0x0, val = 88}, {name = 0x5555562a7bf6 "compress", has_arg = 0, flag = 0x0, val = 122}, {
            name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
#4  0x0000555555b9232f in main (argc=1, argv=0x7fffffffdad8) at /usr/src/sysdig-0.26.4/userspace/sysdig/sysdig.cpp:1692
        res = {m_res = 0, m_next_run_args = std::vector of length 0, capacity 0}
dnwe commented 4 years ago

@fntlnz FYI: I don't appear to be able to produce this with a package built off current master (586e8fc) so it seems like this issue has been fixed as a result of one or other of those changes

nathan-b commented 4 years ago

@gnosek do you think this could be https://github.com/draios/sysdig/pull/1528?

dnwe commented 4 years ago

@nathan-b thanks for the hint β€” it looks like I can cleanly apply 1528.patch on top of 0.26.4, so I'll build and test that now

patching file userspace/libsinsp/sinsp.cpp
Hunk #1 succeeded at 1006 (offset -3 lines).
Hunk #2 succeeded at 1161 (offset -3 lines).
patching file userspace/libsinsp/sinsp.h
dnwe commented 4 years ago

@nathan-b I can confirm that if I build my own "0.26.5" by taking https://github.com/draios/sysdig/archive/0.26.4.tar.gz and applying 1528.patch over the top then the segfault issue is fixed for me and sysdig runs smoothly

Is this something that could be put out as an official release with a cherry-pick onto a release branch?

nathan-b commented 4 years ago

Thanks for doing the test! That may be something we can do. I'll check.

dnwe commented 4 years ago

@nathan-b any update on a new release version?

nathan-b commented 4 years ago

Sorry for the delay. We are working on it, I promise. I can't give an ETA at this time but it is in progress.

dnwe commented 4 years ago

@nathan-b πŸŽ‰ thank you

https://github.com/draios/sysdig/releases/tag/0.26.5

dnwe commented 4 years ago

@ig0rsky still no .deb at http://download.draios.com/stable/deb/stable-amd64/Packages yet

nathan-b commented 4 years ago

I don't know what's going on with the 0.26.5 release. I'll try to chase it down for you guys. Sorry for the long delay.

jfreeland commented 4 years ago

I was just trying to play with kubectl-capture on a generic gke node and I'm running into the same issue. It looks like https://hub.docker.com/r/sysdig/sysdig could use a new version too.

ig0rsky commented 4 years ago

I recently tried out the kubectl-capture tool, ran into the same issue on an Ubuntu 16.04 node on AWS. @nathan-b could you check again for a release where this is fixed?

nathan-b commented 4 years ago

OK, so apparently the release got halfway done and then stalled because an internal Jenkins process that's used to actually push the release doesn't work anymore. It stalled for an honestly ridiculously long amount of time for reasons I don't quite understand.

I've volunteered to take over the release, including fixing the busted process so future releases will go smoothly again. It's not the only thing I'm working on, but it's now one of the most important things on my plate, and I promise you that you'll see movement.

Once again, apologies for the delay.

nathan-b commented 4 years ago

Q: How many engineers does it take to fix one build script? A: I don't know, but apparently more than three.

I'm still working on this. Some previous engineer had meticulously set up a very precise environment in which this script would run, and now that environment is no longer there. I'm working on containerizing this script so it can run on any machine with docker installed, so be patient a little longer and we'll have a release cranked out. Thanks!

ig0rsky commented 4 years ago

@nathan-b a bit late to containerize your pipelines :), you would think that this has already been done long ago. :)

nathan-b commented 4 years ago

Most of them are...I'm surprised this one isn't. According to git, this script was written Fri Aug 1 15:08:34 2014 -0700 and has received little more than targeted fixes since then.

nathan-b commented 4 years ago

Still working on this...believe me; I want this done every bit as much as you want this done :)

ig0rsky commented 4 years ago

Needed to showcase this to our developers on a Kubernetes cluster (kubectl-capture plugin), currently looking at other solutions, like SystemTap or perf :). What sucks is that we're on the Kernel v4.4 LTS and thus eEPF doesn't work, so I couldn't even test the --ebpf flag.

Otherwise, alerts-based Sysdig capture for later analysis using Sysdig Inspect would be an excellent overall open-source solution, which is what I wanted to set up and showcase :)

ig0rsky commented 4 years ago

@nathan-b 0.26.6 is released πŸŽ‰ :D Could you also make sure that the binaries are distributed onto Ubuntu/Debian as part of the pipeline?

gnosek commented 4 years ago

@ig0rsky I'm working on it, the binaries should be available today in our repositories. If you mean upstream Debian/Ubuntu, that's out of our control :)

ig0rsky commented 4 years ago

@gnosek nope, I meant the binaries for Ubuntu and Debian that's not upstream :) Closing this as the issue will be resolved with the new binaries.

nathan-b commented 4 years ago

Finally we have the answer to the question I posed earlier: How many engineers does it take to fix one build script? One, so long as it's @gnosek :)

davidreuss commented 4 years ago
root@27ff10a1558b:/# apt update
[...]
Hit:4 http://download.draios.com/stable/deb stable-amd64/ InRelease
[...]

root@27ff10a1558b:/# apt-cache policy sysdig
[...]
sysdig:
  Candidate: 0.26.4
[...]

this still means we're waiting on the binaries to hit your repositories, right? πŸ˜ƒ

gnosek commented 4 years ago

Yes, releasing things is hard ;) Sorry about that, I want the release out at least as much as you do

gnosek commented 4 years ago

The eagle has landed πŸŽ‰

@krishnan-ramkumar's invaluable help bumps the final answer to "how many engineers does it take to release Sysdig" to at least four :)

dnwe commented 4 years ago

πŸŽ‰ now showing up at http://download.draios.com/stable/deb/stable-amd64/Packages

Will roll this out. Thanks for getting this out the door! πŸ…

davidreuss commented 4 years ago

still seeing,

sysdig:
  Installed: 0.26.4
  Candidate: 0.26.5

is that expected, when the latest release is 0.26.6 ?

edit: Tried 0.26.5 on our boxes -- no crash, so that's amazing 🎈

gnosek commented 4 years ago

@davidreuss, you made me realize I released it under the wrong version number :(

I'll probably pretend 0.26.6 never happened and 0.26.5 just got severely delayed. In any case, anything newer than 0.26.4 contains the fix.

nathan-b commented 4 years ago

It's like a well-oiled machine here at Sysdig release HQ

gnosek commented 4 years ago

Wouldn't guess oil was the substance in question ;)

shakibamoshiri commented 2 years ago

Has this issue solved ? we get it

Ubuntu 20

sysdig -c echo_fds proc.name=java
Segmentation fault (core dumped)

sysdig --version
sysdig version 0.26.4
nathan-b commented 2 years ago

@shakibamoshiri I will refer you back to @gnosek 's comment which says anything newer than 0.26.4 contains the fix. As the most recent version is 0.29.3, it seems that the version in Ubuntu 20's package manager is just no longer being updated. You can always build from source if you want to stay on 20.

shakibamoshiri commented 2 years ago

@nathan-b Thanks, building from source on Debian 10 caused lots of libraries misconfiguration issues and thus preferred not to compile it. One more thing, I installed it on two Ubuntu 20, just in one of them we get the errors.

nathan-b commented 2 years ago

That is a strange issue indeed. You can try running it from a container instead -- no guarantee it will work any better, but it might be worth a shot :)

shakibamoshiri commented 2 years ago

@nathan-b it is a production server and did not compile it - just downloaded the latest deb package and updated it, worked find . thank you so much

lhzw commented 2 years ago

0.29.3 deb works fine on 20.04 with docker contaners. 0.26.4 in apt source works on a clean machine without docker containers, segmentation fault, with output like this:

23 16:29:11.933545000 0 container:dae6a493a2a9 (-1) > container json={"container":{"Mounts":[{"Destination"

dpkg -l | grep sysdig
ii  sysdig                               0.26.4-1ubuntu0.3                 amd64        system-level exploration and troubleshooting tool
ii  sysdig-dkms                          0.26.4-1ubuntu0.3                 all          system-level exploration and troubleshooting tool - kernel source