ArduPilot / apm_planner

APM Planner Ground Control Station (Qt)
https://ardupilot.org
Other
510 stars 464 forks source link

QLOG causes segfault #941

Closed AndKe closed 8 years ago

AndKe commented 8 years ago

Please see attached .zip (some .tlog from forums) (reproduced both on current master)

20160217151228.log.zip

it ends like:

ERROR 2016-04-30T15:11:27.471 Corrupt data read: Time is not increasing! Last valid time stamp: "622764"  actual read time stamp is: "553687" 
ERROR 2016-04-30T15:11:27.472 Corrupt data read: Time is not increasing! Last valid time stamp: "622764"  actual read time stamp is: "553687" 
ERROR 2016-04-30T15:11:27.472 "Unable to exec getMinTime query: No query Unable to fetch row" 
 INFO 2016-04-30T15:11:27.472 Plot Log loading took 0.517 seconds - 1554368 of 1554368 bytes used 
[Thread 0x7fff8affd700 (LWP 20609) exited]

Thread 1 "apmplanner2" received signal SIGSEGV, Segmentation fault.
Arne-W commented 8 years ago

Hi guys - it took a little longer to do some tests :grimacing: To make a long story short - Still no solution found. :pensive: Here is what I did:

First of all I stepped through the QsLog calls to check if there could be some issues and I didn't find anything. If the thread is disabled the QsLog call is complete synchronous (could be tracked down to the fprintf() call). As far as I could see there is nothing where you could get any mem corruptions. Just to be sure I changed some calls from by reference to by value to create a copy of the string before its printed but it did not change anything. Works on 14.04 and 15.10 but crashes on 16.04. Next I did some valgrind runs. As you know valgrind runs are very slow (about 10 times slower) and interestingly sometimes the tlog could be parsed completely during those runs - it is definitely some sort of race which is influenced by the execution speed. This is a part of a valgrind log on a run with a crash:

ERROR 2016-05-13T16:14:17.497 Corrupt data read: Time is not increasing! Last valid time stamp: "1726795"  actual read time stamp is: "1726630" 
ERROR 2016-05-13T16:14:17.565 Corrupt data read: Time is not increasing! Last valid time stamp: "1727297"  actual read time stamp is: "1727232"  
==3754== Invalid read of size 4
==3754==    at 0x6EE2EE1: QTextEngine::shapeTextWithHarfbuzzNG(QScriptItem const&, unsigned short const*, int, QFontEngine*, QVector const&, bool) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EE6220: QTextEngine::shapeText(int) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EE6DEE: QTextEngine::shape(int) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EEC6C3: QTextEngine::shapeLine(QScriptLine const&) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EFBB78: QTextLine::cursorToX(int*, QTextLine::Edge) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6F47AC3: QTextCursorPrivate::setX() (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6F4F8AF: QTextCursor::setPosition(int, QTextCursor::MoveMode) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x65EB750: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==    by 0x65F31AA: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==    by 0x7D2CEA0: QObject::event(QEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.5.1)
==3754==    by 0x641F05B: QApplicationPrivate::notify_helper(QObject*, QEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==    by 0x6424515: QApplication::notify(QObject*, QEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==  Address 0x50 is not stack'd, malloc'd or (recently) free'd
==3754== 
==3754== 
==3754== Process terminating with default action of signal 11 (SIGSEGV)
==3754==  Access not within mapped region at address 0x50
==3754==    at 0x6EE2EE1: QTextEngine::shapeTextWithHarfbuzzNG(QScriptItem const&, unsigned short const*, int, QFontEngine*, QVector const&, bool) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EE6220: QTextEngine::shapeText(int) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EE6DEE: QTextEngine::shape(int) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EEC6C3: QTextEngine::shapeLine(QScriptLine const&) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6EFBB78: QTextLine::cursorToX(int*, QTextLine::Edge) const (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6F47AC3: QTextCursorPrivate::setX() (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x6F4F8AF: QTextCursor::setPosition(int, QTextCursor::MoveMode) (in /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.5.1)
==3754==    by 0x65EB750: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==    by 0x65F31AA: ??? (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==    by 0x7D2CEA0: QObject::event(QEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Core.so.5.5.1)
==3754==    by 0x641F05B: QApplicationPrivate::notify_helper(QObject*, QEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==    by 0x6424515: QApplication::notify(QObject*, QEvent*) (in /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5.5.1)
==3754==  If you believe this happened as a result of a stack
==3754==  overflow in your program's main thread (unlikely but
==3754==  possible), you can try to increase the size of the
==3754==  main thread stack using the --main-stacksize= flag.
==3754==  The main thread stack size used in this run was 8388608.
==3754== 
==3754== HEAP SUMMARY:
==3754==     in use at exit: 107,227,887 bytes in 556,025 blocks
==3754==   total heap usage: 20,073,970 allocs, 19,517,945 frees, 1,963,833,237 bytes allocated

I am not an valgrind expert but an access to 0x50 should not happen. The crash is always within the Qt librarys so meanwhile I assume that we will not find it. Perhaps we should think about a workaround - but replacing all QsLog calls will be a lot of work. I don't have any idea what to do next...

dcarpy commented 8 years ago

@Arne-W Very interesting. Well, I doubt it helps much but here's what I noticed/did on Windows: I replaced QsLog with the latest and greatest, 2.1.0. I played around with MaxSizeBytes and MaxOldLogCount. My test was loading a tlog with lots of errors in it. If I increase MaxSizeBytes or increase MaxOldLogCount enough so that all the logs don't roll then it does not core. Sometimes I can get it to core if I load tlogs back to back and making all the logs to roll. I guess what I'm saying is I agree with Arne as some kind of race condition. Do you guys think it would help if I build with Qt 5.6.0 and see what that does?

AndKe commented 8 years ago

@dcarpy I tried to build on 5.6 , but there's something 5.6 does not like , see here https://github.com/ArduPilot/apm_planner/issues/941#issuecomment-216059614

dcarpy commented 8 years ago

@AndKe Thanks AndKe--didn't see that. Well, maybe I should start building commits to see where it was introduced? Seems like a witch hunt but... :-)

billbonney commented 8 years ago

As somebody tried building Qt5.6 from source on Linux I'm not seeing this crashing on OSX at all If there is a compiler issue, that would probably resolve the problem.

AndKe commented 8 years ago

@dcarpy As to "when it was introduced" - I really believe it always were there, but got more prominent on ubuntu 16.04 (and/or windows ?) We know that fast log output (recently added as a part of .tlog loading) crashes it - the same goes for opening a "tuning" view, - clicking "extended tuning" lists fetched parameters, and crashes often in the process. This is kind of something I remember to have seen very occasionally in the past too maybe once a month, not often enough to do something about it. but recently it became highly reproducible. So my guess would be that even older versions will crash as much, if they use logging much.

To narrow it down it might be an idea to skip 99% of the application, just make a loop that QLOG's a text line 1000 times right in the beginning.

dcarpy commented 8 years ago

@Andre Thanks. @billbonney Hmm...maybe we are back to compiler versions. What version of gcc are you using on OSX? For Windows, I'm using MSVC2013. I built Qt 5.6.0 using MSVC2015...and so far I cannot get it to crash. Well, scratch that: Qt 5.6.0 using MSVC2015 crashed too. Dang it.

Arne-W commented 8 years ago

@dcarpy are those crashes on windows related to the file logger or have you switched it of? If not switch it of and try again. I suppose the crash on win is file logger related. According to the compiler versions I would like to mention that we have already made cross tests:

dcarpy commented 8 years ago

@Arne-W Ok, I'll turn file logger off and test. @billbonney I'm compiling Qt 5.6.0 now. It looks like we have tested all combos on 16.04 except 5.6.0 maybe...

AndKe commented 8 years ago

@Arne-W your finding about where it's compiled vs executed, matches my https://github.com/ArduPilot/apm_planner/issues/941#issuecomment-216287126

@dcarpy fingers crossed for 5.6 , but as I wrote in the comment above, it seems like something with the execution environment, rather than compilation tools.

Arne-W commented 8 years ago

I have created a version that emits LOTS of error messages. You just have to open a graph and right before loading a 2 second while loop emits messages as fast as possible. If you want to test here is my branch https://github.com/Arne-W/apm_planner/tree/QsLogTest On my 15.10 it works - no problems at all. But it is very interesting to look at the memory usage of apmplanner. With the mass logging version you can see that it needs much more memory compared to the normal version and its growing as long as the logging creates output. Furthermore you can see the latency - console is too slow. If you change this line from while (QDateTime::currentMSecsSinceEpoch() < msecs + 2000) to while (QDateTime::currentMSecsSinceEpoch() < msecs + 10000) increasing the time to 10 seconds the process grows extraordinary. Perhaps this growing leads to the crash. If it is the mass logging version should crash immediately after the log file to load was opened. Perhaps someone can test it - I do not have a 16.04 at the moment.

By the way I made a pull request with some valgrind related fixes - perhaps some can test this too. I do not think it fixes this issue but perhaps there are some sideeffects. @AndKe I found a race with the threadDone method and the loadprogress - perhaps this is the real problem we worked around in #940. We should test it someday...

AndKe commented 8 years ago

@Arne-W your QsLogTest on 16.04: (after 50-100 lines)

ERROR 2016-05-15T17:33:08.010 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. 
Thread 1 "apmplanner2" received signal SIGABRT, Aborted.
0x00007ffff408f418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) 

I can try pr944 in the evening, - need to find out how to pull single PR :)

AndKe commented 8 years ago

@Arne-W got pr944 using git fetch https://github.com/ArduPilot/apm_planner/ refs/pull/944/head && git checkout -b test FETCH_HEAD it seems to be right way to do it... - it crashes same way as before, in libQt5Gui.so.5

Arne-W commented 8 years ago

Okay thanks for testing :+1: So it is definitely related to the fast logging. @AndKe if you still have some time could you test my updated QsLogTest branch? I changed the fprintf to fputs pehaps it changes something.

AndKe commented 8 years ago

@Arne-W just tested current QsLogTest it's still crashing at

Thread 1 "apmplanner2" received signal SIGABRT, Aborted.
0x00007ffff408f418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

and thank you for working on this issue, I know that some may think it's a limited problem , but I also know that a lot of other distributions will soon be based on ubuntu 16.04 (or contain same libraries/kernel/etc) to me AP2 is the only GUI GCS , and I have tried myself to get freelancers (2) to fix the intel graphics+primary flight display bug - and know the hard way how difficult is to get something fixed.

Arne-W commented 8 years ago

@AndKe Thanks again - the last test for today. I removed the starting of apmplanner from main. This version will only create loggers and logs for 2 seconds. Then sleeps for 60 seconds to wait for all log output has been printed. If the log output has not finished before the program terminates it crashes - I know :smiley: It is only to test if the programm crashes if it just logs.

AndKe commented 8 years ago

@Arne-W sorry for the late response, I was painting outside until it started to rain. The current test runs for ~15 sec every time (tested with stopwatch three times) - displaying insane amount of loglines, about ~120 lines of each 10ms (second with same three decimals) both in the beginning and when it stops. (so performance seems about the same across the test) when it stops at about 15 sec, I wait for 60sec, and it ends like (tested three times)

[Thread 0x7fffea4878c0 (LWP 21367) exited]
[Inferior 1 (process 21367) exited normally]
(gdb) 

BTW: just remembered/discovered that all cores minus one for feeding GPU are aligning photos for a map - so the system is highly loaded - possibly reducing performance/affecting race conditions - I'll redo the test as soon it's done.. Will notify you if the outcome is different in any way.

AndKe commented 8 years ago

@Arne-W this post is only to notify you of change to my previous post if you read in by email. Sorry - I forgot about the 60s timeout - please check updated post on github (no crash)

AndKe commented 8 years ago

@Arne-W I also paused the photo processing job - and can confirm that the latest test does not crash with on an "idle" computer. (unaffected by CPU load) I also tested current master in the same heavily loaded environment, - and it crashed just the same. So your test does somehow do a lot of logging just fine.

dcarpy commented 8 years ago

Well, Qt 5.6.0 behaves (crash) the same as 5.5.1. @AndKe Don't feel you have to but if you want to try 5.6.0 on 15.10 here is the installer (built with 5.6.0 static libs): https://www.dropbox.com/sh/8qwxhkp3disjycg/AADp333JU64hayLl70CgY2_Na?dl=0 MD5: dd99f2d4ded8b7310f9c29d2cf2f57b3

Arne-W commented 8 years ago

What a mess. The fact that the "logging only" version does not crash proves that the problem is somehow created by the apm planner code... we will have to find it by ourselves.

billbonney commented 8 years ago

Keep up the good work, this look like an interesting and frustrating problem. What the immediate steps to resolve.

  1. Fix Windows builds buy reverting back to non-threaded logging
  2. OSX build works in any case.
  3. Look at new fix for 16.04. 15.10 works with non-threaded logging
AndKe commented 8 years ago

maybe if you make the logtest a function that can be called, it could be called before/after GUI parts are loaded and lots of other places, just to see where calling for QLOG starts to be a problem ? another thing to have in mind is that more than tlog-related logging that crashes it: -rapid logging when receiving parameters via telemetry/USB -opening "exended tuning" for copter so we must not think it's tightly related to "graph" view.

billbonney commented 8 years ago

I think I rather just redefine the QSLOG macros to use Qt inbuilt logging system and make the build stable again. see http://doc.qt.io/qt-5/debug.html

Qt's inbuilt system doesn't provide file logging AFAIU, unless it has changed and was the original reason APM Planner 2.0 uses QSLOG

AndKe commented 8 years ago

@billbonney - logging to file is less important, log rotation have "always" been broken, and current log is overwritten on AP2 start anyway. Logging to stdout is what we usually look at anyway, and if file logging is desired, I could always use tee command to duplicate output to file.. I guess osx people can do it too. So "yes" , skipping QSLOG is one way around it. There is however one tiny, scary part, - if we assume QSLOG itself is not buggy (https://github.com/ArduPilot/apm_planner/issues/941#issuecomment-219307185) - but something else is corrupting it, -will something else be corrupted if we replace QSLOG ? I do suspect there is something very strange somewhere, had two freelancers to look at https://github.com/ArduPilot/apm_planner/issues/646 - and it's a mystery to them as well.

BTW: Loading AP2 with no widgets does not affect the logging problem.

Arne-W commented 8 years ago

Hi all, I removed the QsLog call here and replaced it by a direct call to fprintf() using the same time stamp creation used in QsLog and it does not crash any more when loading the log. I will have a closer look at QsLog again. If someone wants to verify checkout this branch: https://github.com/Arne-W/apm_planner/commits/Logging_Tests Also replacing QLOG_ERROR() with a qWarning() call at the same place works too.

dcarpy commented 8 years ago

@Arne-W Nice! I'll check it out.

Arne-W commented 8 years ago

@dcarpy if you want to test the fprintf() version you will have to use this commit: Arne-W/apm_planner@2e19fd6e7c31314abdbbabcba8a391fee78d4231 as I will push some more test commits to this branch.

AndKe commented 8 years ago

@Arne-W Logging_Tests

EDIT: the crash was unrelated - the fprintf() works fine. - created a new issue for that .

Arne-W commented 8 years ago

@AndKe very good - as I have only replaced the lines were the Corrupt data read: Time is not increasing..... text is generated the crash afterwards is also triggered through the logging when populating the graph with the mode events. If you want to test some more - I have pushed a new commit to the Logging_Tests branch which changes the the fprintf() call to a slightly modified QsLog helper call which does not print in the DTOR but by an explicit call. Its this one Arne-W/apm-planner@59981c74d38d238f2a622560bf7365b15ca16430 Moreover you can use this branch. I have patched the QsLog macros to use the Qt internal logging system bypassing nearly all QsLog code. This branch should not crash anymore, but logging has no timestamp.

AndKe commented 8 years ago

@Arne-W https://github.com/Arne-W/apm_planner/commits/Replace_QsLog_Test works fine.

Arne-W commented 8 years ago

Seems that we have a simple workaround for this issue - very good - so we have more time to investigate the "real" problem. @billbonney Perfect Idea just to redefine the QsLog Macros. Shall I make a PR for that - I think it should not be such a fast hack like my actual branch. I did some more tests and I am in doubt that it is only apm-planner related. The fact that patching the QsLog macros made the problem go away makes me wonder. I have an uneasy feeling about the logging helper and the logwriting in the DTOR see here. It seems get a bit more stable (crashes too but later, I think) with my hack mentioned above (Arne-W/apm-planner@59981c74d38d238f2a622560bf7365b15ca16430)

Arne-W commented 8 years ago

I extended my hack which removed the logwriting from the DTOR. The buffer and qtDebug members are now placed on the heap. With this change it sometimes loads the log without a crash (loaded 5 times 4 without a crash). Its getting more stable. I am pretty sure the logging helper is the culprit. I have no Idea what exactly happens and why it works on 15.10 and 14.04 - but manipulating this object changes the behaviour. Perhaps we should change the logging framework or invent one by ourselves. BTW: @billbonney QDebug supports file logging too. http://doc.qt.io/qt-5/qdebug.html#QDebug

billbonney commented 8 years ago

@Arne-W re comment https://github.com/ArduPilot/apm_planner/issues/941#issuecomment-219333054

I think we should just move the logging back to using Qt's inbuilt system. QSLOG was added when we stated APM Planner 2.0 with Qt4.8 and it didn't support file logging.

I would say just redefining the QSLOG macros is the best first step. And then we can test it well, then may remove the QSLOG macros all together.

Arne-W commented 8 years ago

Made a new PR #950 mainly for the QsLog workaround. The other ones are minor fixes. Feel free to test but it seems to work :smiley: I added also a time-stamp to the log-lines and an if for trace messages which are disabled by default.

AndKe commented 8 years ago

tested https://github.com/ArduPilot/apm_planner/pull/950 a lot. Unable to crash by logging/reading logs. more than that: it seems to fix https://github.com/ArduPilot/apm_planner/issues/947 as well.

dcarpy commented 8 years ago

@Arne-W @AndKe Outstanding!

Arne-W commented 8 years ago

:tada: :confetti_ball: :tada: :confetti_ball: :tada: :confetti_ball: Nice to hear that the workaround solves this issue but I did not really understand why the usage of QsLog leads to all of this crashes - strange! Nevertheless I'm very happy that we have found a solution that seems to be stable :+1: @AndKe Thanks for all the tests you have done and all the information you gave - it helped a lot! @dcarpy @billbonney Thanks for your support too! Good team - good work!

AndKe commented 8 years ago

Thanks to all, it's fixed now.

Arne-W commented 8 years ago

Hey guys, while coding to re enable the debug output window today (I had disabled while migrating the logging system), I think I found the real culprit for this issue. The DebugOutput widget was the problem. It used a QTextBrowser window object which is NOT re-entrant directly from the write() method. This leads to the problem that an non re-entrant object was used from more than one thread - which caused those random crashes. The moment of the crash relies on the internal scheduling of the different threads accessing the object - from the user view it looks random. I will make a PR in the next days with a decoupled debug output window.

billbonney commented 8 years ago

@Arne-W Awesome, great work

dcarpy commented 8 years ago

@Arne-W Dude! Very cool.

Arne-W commented 8 years ago

Thanks :relaxed:

AndKe commented 8 years ago

Fantastic ! - and very impressive, given how many looked into this problem. Please have a look at https://github.com/ArduPilot/apm_planner/issues/646 - maybe you are the one to solve that one - I even had(/have) a freelancer bounty, but nobody took the challenge after looking into it. - It's sad how long the bug have been open, and how it destroys the runtime of a laptop that can run HD moves for 6 hours on a battery - or AP2 for ~two hours due to the bug.