Open jellehierck opened 5 months ago
Thanks for the detailed report. We'll take a look once we get a bit of time.
What platform are you trying this on? Arm64 or x86? What's the procedure you used to reproduce this? I can't seem to reproduce this problem on my setup. IIRC the trace data is not stored on the stack, so I'm not sure why increasing the stack size would help you.
I am running on Linux 20.04, kernel version 5.15.129-rt67, with x86 architecture.
To reproduce I clone the project, build it in Release and then run the tracing example:
make release
./build/release/examples/tracing_example/rt_tracing_example
The program crashes after 5 seconds, which is when the first trace is written to a file.
When I apply the "fix" I mentioned above, I now get the terminate called after throwing an instance of 'St9bad_alloc'
error after 15 seconds, i.e. when the second trace is written.
This is interesting, and indeed shows that the increased stack size does not fix the problem. I have updated the original question to reflect this.
I will try to do some more testing on my end. If you want me to run specific tests or benchmarks, let me know.
I also repeated this on another machine running Linux 20.04, kernel version 5.15.137-rt71, with x86 architecture.
On this machine, leaving the stack size unchanged resulted in a crash after 5 seconds. Adjusting the stack size let the program finish without problems. Could the kernel version be the culprit?
I will try this on a third machine running another version of PREEMPT_RT next week.
Interesting. I don't have a 20.04 and 5.15 kernel to test. If you can get a core dump that might also be helpful.
Coredump of tracing example built in Debug running on my 20.04 kernel 5.15.137-rt71 machine:
git clone https://github.com/cactusdynamics/cactus-rt.git
cd cactus-rt
make debug
./build/debug/examples/tracing_example/rt_tracing_example
I just noticed that you merged https://github.com/cactusdynamics/cactus-rt/pull/70. The coredump in my comment above was created before I pulled these changes (i.e. cce75120bab481f9d67a2f938633da0286030dba).
The same bad alloc error occurs after pulling the newest changes. Here is a coredump when running the latest changes.
I see in your latest coredump there is a line: #10 0x00007f8fe46da418 _ZN6google8protobuf8internal20RepeatedPtrFieldBase14InternalExtendEi (libprotobuf.so.17 + 0x10b418)
.
I'm wondering if this is a protobuf bug somewhere because 20.04 is quite old, and cactus-rt currently links with the system-level protobuf. My version:
$ apt list --installed | grep libprotobuf
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libprotobuf-dev/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]
libprotobuf-lite23/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]
libprotobuf23/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]
$ apt list --installed | grep libprotobuf
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libprotobuf-dev/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]
libprotobuf-lite17/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]
libprotobuf17/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]
So yes, it is indeed a few versions behind.
Hard to say if that is the problem, I've also created a PR that would check the header compiled against the actual installed library, in case they are different (which could cause segfaults): https://github.com/cactusdynamics/cactus-rt/pull/75. Maybe you can try it to make sure nothing wrong is happening..?
Adding GOOGLE_PROTOBUF_VERIFY_VERSION;
to app.cc
and running the rt_tracing_example
again did not cause the program to abort, so the header and library versions seem to match at least.
I also checked /usr/include/google/protobuf/stubs/common.h
(where GOOGLE_PROTOBUF_VERIFY_VERSION
is defined) and saw that the library version is indeed 3.6.1 (#define GOOGLE_PROTOBUF_MIN_LIBRARY_VERSION 3006001
), so the headers match the libraries.
When running the
tracing_example
, my program crashes when a trace session is stopped with the following error message:The crash happens on line line 87:
app.StopTraceSession();
When using the debugger, I found out that the program crashes when the trace aggregator thread is joined inapp.cc:218
.After the crash, a
data1.perfetto
file is created but only contains ~30 loops for me. This made me suspect that the crash might be due to a small stack size as you mentioned in your blog (part 4).Note that this crash also occurs when I try to implement tracing in my own code, i.e. not just the
tracing_example
program.Fix:increase thread stack sizeEdit: this "fix" might make the problem occur less often, but not disappear (see this comment).
The default stack size set in
ThreadConfig
is 8 MB. Increasing the stack size to 16 MB caused fixed the crash for me. I did this by adding the following line to thethread_config
section intracing_example/main.cc:64
:After this fix, the program does not crash for me anymore and a correct
data1.perfetto
file is created with the entire trace.