Tracing example crashes with bad alloc due to small stack size

jellehierck commented 5 months ago

When running the tracing_example, my program crashes when a trace session is stopped with the following error message:

terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

The crash happens on line line 87: app.StopTraceSession(); When using the debugger, I found out that the program crashes when the trace aggregator thread is joined in app.cc:218.

After the crash, a data1.perfetto file is created but only contains ~30 loops for me. This made me suspect that the crash might be due to a small stack size as you mentioned in your blog (part 4).

Note that this crash also occurs when I try to implement tracing in my own code, i.e. not just the tracing_example program.

Fix: increase thread stack size

Edit: this "fix" might make the problem occur less often, but not disappear (see this comment).

The default stack size set in ThreadConfig is 8 MB. Increasing the stack size to 16 MB caused fixed the crash for me. I did this by adding the following line to the thread_config section in tracing_example/main.cc:64:

int main() {
  cactus_rt::CyclicThreadConfig thread_config;
  thread_config.period_ns = 1'000'000;
  thread_config.cpu_affinity = std::vector<size_t>{2};
  thread_config.SetFifoScheduler(80);
  thread_config.stack_size = 16 * 1024 * 1024;  // <--- set larger stack size

  ...

After this fix, the program does not crash for me anymore and a correct data1.perfetto file is created with the entire trace.

shuhaowu commented 5 months ago

Thanks for the detailed report. We'll take a look once we get a bit of time.

shuhaowu commented 4 months ago

What platform are you trying this on? Arm64 or x86? What's the procedure you used to reproduce this? I can't seem to reproduce this problem on my setup. IIRC the trace data is not stored on the stack, so I'm not sure why increasing the stack size would help you.

jellehierck commented 4 months ago

I am running on Linux 20.04, kernel version 5.15.129-rt67, with x86 architecture.

To reproduce I clone the project, build it in Release and then run the tracing example:

make release
./build/release/examples/tracing_example/rt_tracing_example

The program crashes after 5 seconds, which is when the first trace is written to a file.

When I apply the "fix" I mentioned above, I now get the terminate called after throwing an instance of 'St9bad_alloc' error after 15 seconds, i.e. when the second trace is written.

This is interesting, and indeed shows that the increased stack size does not fix the problem. I have updated the original question to reflect this.

I will try to do some more testing on my end. If you want me to run specific tests or benchmarks, let me know.

jellehierck commented 4 months ago

I also repeated this on another machine running Linux 20.04, kernel version 5.15.137-rt71, with x86 architecture.

On this machine, leaving the stack size unchanged resulted in a crash after 5 seconds. Adjusting the stack size let the program finish without problems. Could the kernel version be the culprit?

I will try this on a third machine running another version of PREEMPT_RT next week.

shuhaowu commented 4 months ago

Interesting. I don't have a 20.04 and 5.15 kernel to test. If you can get a core dump that might also be helpful.

jellehierck commented 4 months ago

Coredump of tracing example built in Debug running on my 20.04 kernel 5.15.137-rt71 machine:

git clone https://github.com/cactusdynamics/cactus-rt.git
cd cactus-rt
make debug
./build/debug/examples/tracing_example/rt_tracing_example

Coredump

``` jelle@jelle-laptop-ubuntu:~/temp/cactus-rt$ coredumpctl info PID: 10698 (rt_tracing_exam) UID: 1000 (jelle) GID: 1000 (jelle) Signal: 6 (ABRT) Timestamp: Mon 2024-03-11 09:48:21 CET (1min 10s ago) Command Line: ./build/debug/examples/tracing_example/rt_tracing_example Executable: /home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example Control Group: /user.slice/user-1000.slice/user@1000.service/vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope Unit: user@1000.service User Unit: vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope Slice: user-1000.slice Owner UID: 1000 (jelle) Boot ID: Machine ID: Hostname: jelle-laptop-ubuntu Storage: /var/lib/systemd/coredump/core.rt_tracing_exam.1000.46b5967af75b46689df7204525afeebe.10698.1710146901000000000000.lz4 Message: Process 10698 (rt_tracing_exam) of user 1000 dumped core. Stack trace of thread 10699: #0 0x00007f427175700b __GI_raise (libc.so.6 + 0x4300b) #1 0x00007f4271736859 __GI_abort (libc.so.6 + 0x22859) #2 0x00007f4271b318d1 n/a (libstdc++.so.6 + 0x9e8d1) #3 0x00007f4271b3d37c n/a (libstdc++.so.6 + 0xaa37c) #4 0x00007f4271b3d3e7 _ZSt9terminatev (libstdc++.so.6 + 0xaa3e7) #5 0x00007f4271b3d699 __cxa_throw (libstdc++.so.6 + 0xaa699) #6 0x00007f4271b314e2 n/a (libstdc++.so.6 + 0x9e4e2) #7 0x000055de2e8193a1 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x413a1) #8 0x000055de2e818c12 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40c12) #9 0x000055de2e81c6cc n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x446cc) #10 0x000055de2e81c5fa n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x445fa) #11 0x000055de2e81c559 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x44559) #12 0x000055de2e81c510 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x44510) #13 0x000055de2e81c4f0 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x444f0) #14 0x00007f4271b69df4 n/a (libstdc++.so.6 + 0xd6df4) #15 0x00007f427190e609 start_thread (libpthread.so.0 + 0x8609) #16 0x00007f4271833353 __clone (libc.so.6 + 0x11f353) Stack trace of thread 10700: #0 0x00007f427193ba97 n/a (libgcc_s.so.1 + 0x12a97) #1 0x00007f42718735f5 __GI___dl_iterate_phdr (libc.so.6 + 0x15f5f5) #2 0x00007f427193c6d1 _Unwind_Find_FDE (libgcc_s.so.1 + 0x136d1) #3 0x00007f4271938878 n/a (libgcc_s.so.1 + 0xf878) #4 0x00007f427193a09b _Unwind_RaiseException (libgcc_s.so.1 + 0x1109b) #5 0x00007f4271b3d68c __cxa_throw (libstdc++.so.6 + 0xaa68c) #6 0x00007f4271b314e2 n/a (libstdc++.so.6 + 0x9e4e2) #7 0x000055de2e831b7a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x59b7a) #8 0x000055de2e83102f n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x5902f) #9 0x000055de2e8304e8 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x584e8) #10 0x000055de2e82f8ba n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x578ba) #11 0x000055de2e82f01a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x5701a) #12 0x000055de2e82dfca n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x55fca) #13 0x000055de2e8002b5 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x282b5) #14 0x000055de2e7fb182 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x23182) #15 0x000055de2e8122eb n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a2eb) #16 0x000055de2e812244 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a244) #17 0x000055de2e81218e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a18e) #18 0x000055de2e8120c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a0c2) #19 0x000055de2e811d12 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x39d12) #20 0x00007f4271b69df4 n/a (libstdc++.so.6 + 0xd6df4) #21 0x00007f427190e609 start_thread (libpthread.so.0 + 0x8609) #22 0x00007f4271833353 __clone (libc.so.6 + 0x11f353) Stack trace of thread 10698: #0 0x00007f427190fcd7 __pthread_clockjoin_ex (libpthread.so.0 + 0x9cd7) #1 0x00007f4271b6a057 _ZNSt6thread4joinEv (libstdc++.so.6 + 0xd7057) #2 0x000055de2e818102 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40102) #3 0x000055de2e7f98c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x218c2) #4 0x000055de2e7f901e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x2101e) #5 0x000055de2e7eae40 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12e40) #6 0x00007f4271738083 __libc_start_main (libc.so.6 + 0x24083) #7 0x000055de2e7eaa7e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12a7e) lines 46-71/71 (END) ```

jellehierck commented 4 months ago

I just noticed that you merged https://github.com/cactusdynamics/cactus-rt/pull/70. The coredump in my comment above was created before I pulled these changes (i.e. cce75120bab481f9d67a2f938633da0286030dba).

The same bad alloc error occurs after pulling the newest changes. Here is a coredump when running the latest changes.

Coredump

``` jelle@jelle-laptop-ubuntu:~/temp/cactus-rt$ coredumpctl info PID: 12625 (rt_tracing_exam) UID: 1000 (jelle) GID: 1000 (jelle) Signal: 6 (ABRT) Timestamp: Mon 2024-03-11 10:16:21 CET (27s ago) Command Line: ./build/debug/examples/tracing_example/rt_tracing_example Executable: /home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example Control Group: /user.slice/user-1000.slice/user@1000.service/vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope Unit: user@1000.service User Unit: vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope Slice: user-1000.slice Owner UID: 1000 (jelle) Boot ID: Machine ID: Hostname: jelle-laptop-ubuntu Storage: /var/lib/systemd/coredump/core.rt_tracing_exam.1000.46b5967af75b46689df7204525afeebe.12625.1710148581000000000000.lz4 Message: Process 12625 (rt_tracing_exam) of user 1000 dumped core. Stack trace of thread 12626: #0 0x00007f8fe40b100b __GI_raise (libc.so.6 + 0x4300b) #1 0x00007f8fe4090859 __GI_abort (libc.so.6 + 0x22859) #2 0x00007f8fe448b8d1 n/a (libstdc++.so.6 + 0x9e8d1) #3 0x00007f8fe449737c n/a (libstdc++.so.6 + 0xaa37c) #4 0x00007f8fe4496359 n/a (libstdc++.so.6 + 0xa9359) #5 0x00007f8fe4496d11 __gxx_personality_v0 (libstdc++.so.6 + 0xa9d11) #6 0x00007f8fe4293bff n/a (libgcc_s.so.1 + 0x10bff) #7 0x00007f8fe4294291 _Unwind_RaiseException (libgcc_s.so.1 + 0x11291) #8 0x00007f8fe449768c __cxa_throw (libstdc++.so.6 + 0xaa68c) #9 0x00007f8fe448b4e2 n/a (libstdc++.so.6 + 0x9e4e2) #10 0x00007f8fe46da418 _ZN6google8protobuf8internal20RepeatedPtrFieldBase14InternalExtendEi (libprotobuf.so.17 + 0x10b418) #11 0x00005639826ed4a9 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x434a9) #12 0x00005639826ec85b n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x4285b) #13 0x00005639826ebacd n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x41acd) #14 0x00005639826eb2dd n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x412dd) #15 0x00005639826ea656 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40656) #16 0x00005639826ea1e4 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x401e4) #17 0x00005639826ee582 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x44582) #18 0x00005639826ee4b0 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x444b0) #19 0x00005639826ee40f n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x4440f) #20 0x00005639826ee3c6 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x443c6) #21 0x00005639826ee3a6 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x443a6) #22 0x00007f8fe44c3df4 n/a (libstdc++.so.6 + 0xd6df4) #23 0x00007f8fe4268609 start_thread (libpthread.so.0 + 0x8609) #24 0x00007f8fe418d353 __clone (libc.so.6 + 0x11f353) Stack trace of thread 12627: #0 0x00007f8fe426f7d1 futex_abstimed_wait_cancelable (libpthread.so.0 + 0xf7d1) #1 0x00005639826cbb00 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x21b00) #2 0x00005639826df94a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3594a) #3 0x00005639826dd9f2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x339f2) #4 0x00005639826da59a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3059a) #5 0x00005639826d6941 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x2c941) #6 0x00005639826d24bf n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x284bf) #7 0x00005639826cd182 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x23182) #8 0x00005639826e42eb n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a2eb) #9 0x00005639826e4244 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a244) #10 0x00005639826e418e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a18e) #11 0x00005639826e40c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a0c2) #12 0x00005639826e3d12 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x39d12) #13 0x00007f8fe44c3df4 n/a (libstdc++.so.6 + 0xd6df4) #14 0x00007f8fe4268609 start_thread (libpthread.so.0 + 0x8609) #15 0x00007f8fe418d353 __clone (libc.so.6 + 0x11f353) Stack trace of thread 12628: #0 0x00007f8fe414b090 __GI___clock_gettime (libc.so.6 + 0xdd090) #1 0x00005639826bf1a1 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x151a1) #2 0x00005639826bcb8d n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12b8d) #3 0x00005639826bffbe n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x15fbe) #4 0x00005639826bfd34 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x15d34) #5 0x00005639826e6098 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3c098) #6 0x00005639826e4c51 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3ac51) #7 0x00007f8fe4268609 start_thread (libpthread.so.0 + 0x8609) #8 0x00007f8fe418d353 __clone (libc.so.6 + 0x11f353) Stack trace of thread 12625: #0 0x00007f8fe4269cd7 __pthread_clockjoin_ex (libpthread.so.0 + 0x9cd7) #1 0x00007f8fe44c4057 _ZNSt6thread4joinEv (libstdc++.so.6 + 0xd7057) #2 0x00005639826ea102 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40102) #3 0x00005639826cb8c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x218c2) #4 0x00005639826cb01e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x2101e) #5 0x00005639826bce40 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12e40) #6 0x00007f8fe4092083 __libc_start_main (libc.so.6 + 0x24083) #7 0x00005639826bca7e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12a7e) ```

shuhaowu commented 4 months ago

I see in your latest coredump there is a line: #10 0x00007f8fe46da418 _ZN6google8protobuf8internal20RepeatedPtrFieldBase14InternalExtendEi (libprotobuf.so.17 + 0x10b418).

I'm wondering if this is a protobuf bug somewhere because 20.04 is quite old, and cactus-rt currently links with the system-level protobuf. My version:

$ apt list --installed | grep libprotobuf

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libprotobuf-dev/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]
libprotobuf-lite23/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]
libprotobuf23/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]

jellehierck commented 4 months ago

$ apt list --installed | grep libprotobuf

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libprotobuf-dev/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]
libprotobuf-lite17/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]
libprotobuf17/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]

So yes, it is indeed a few versions behind.

shuhaowu commented 4 months ago

Hard to say if that is the problem, I've also created a PR that would check the header compiled against the actual installed library, in case they are different (which could cause segfaults): https://github.com/cactusdynamics/cactus-rt/pull/75. Maybe you can try it to make sure nothing wrong is happening..?

jellehierck commented 4 months ago

Adding GOOGLE_PROTOBUF_VERIFY_VERSION; to app.cc and running the rt_tracing_example again did not cause the program to abort, so the header and library versions seem to match at least.

I also checked /usr/include/google/protobuf/stubs/common.h (where GOOGLE_PROTOBUF_VERIFY_VERSION is defined) and saw that the library version is indeed 3.6.1 (#define GOOGLE_PROTOBUF_MIN_LIBRARY_VERSION 3006001), so the headers match the libraries.

cactusdynamics / cactus-rt

Tracing example crashes with bad alloc due to small stack size #67

Fix: increase thread stack size