ArduPilot / apm_planner

APM Planner Ground Control Station (Qt)
https://ardupilot.org
Other
507 stars 464 forks source link

QLOG causes segfault #941

Closed AndKe closed 8 years ago

AndKe commented 8 years ago

Please see attached .zip (some .tlog from forums) (reproduced both on current master)

20160217151228.log.zip

it ends like:

ERROR 2016-04-30T15:11:27.471 Corrupt data read: Time is not increasing! Last valid time stamp: "622764"  actual read time stamp is: "553687" 
ERROR 2016-04-30T15:11:27.472 Corrupt data read: Time is not increasing! Last valid time stamp: "622764"  actual read time stamp is: "553687" 
ERROR 2016-04-30T15:11:27.472 "Unable to exec getMinTime query: No query Unable to fetch row" 
 INFO 2016-04-30T15:11:27.472 Plot Log loading took 0.517 seconds - 1554368 of 1554368 bytes used 
[Thread 0x7fff8affd700 (LWP 20609) exited]

Thread 1 "apmplanner2" received signal SIGSEGV, Segmentation fault.
Arne-W commented 8 years ago

Ahh - perfect already done :+1: I will check that too.

Arne-W commented 8 years ago

Strange - I can't reproduce this as well... I tried the master, our crashfix branch and the new pull request - always without any issue and it is a exported log not a tlog....

AndKe commented 8 years ago

So sorry, wrong file - I need to delete all logs in Download, I do look at a lot, then mix them up when names are similar, - it should stop loading at ~20% (always in same spot) 20160318172628.tlog-tested.zip

strangly, it varies where and how it crashes, and so resembles a bit the previous issue fixed yesterday. this is tested on your build with the fix. (and today's master)

ERROR 2016-05-01T10:15:01.489 Corrupt data read: Time is not increasing! Last valid time stamp: "568881"  actual read time stamp is: "568815" 
ERROR 2016-05-01T10:15:01.490 Corrupt data read: Time is not increasing! Last valid time stamp: "569362"  actual read time stamp is: "569217" 
ERROR 2016-05-01T10:15:01.493 Corrupt data read: Time is not increasing! Last valid time stamp: "570806"  actual read time stamp is: "570801" 

Thread 1 "apmplanner2" received signal SIGSEGV, Segmentation fault.
0x00007ffff5b77a68 in ?? () from /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5
(gdb) 
ERROR 2016-05-01T10:15:56.640 Corrupt data read: Time is not increasing! Last valid time stamp: "661233"  actual read time stamp is: "661208" 
ERROR 2016-05-01T10:15:56.641 Corrupt data read: Time is not increasing! Last valid time stamp: "661715"  actual read time stamp is: "661629" 
ERROR 2016-05-01T10:15:56.643 Corrupt data read: Time is not increasing! Last valid time stamp: "662679"  actual read time stamp is: "662613" 

Thread 1 "apmplanner2" received signal SIGSEGV, Segmentation fault.
0x00007ffff5b687b7 in QTextEngine::validate() const () from /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5
(gdb) 
Arne-W commented 8 years ago

Also with the new file I am not able to reproduce this. I think we should check your environment perhaps it is something strange with your Qt version cause your crashes are Qt related and not in the APM code. I am using a Kubuntu 15.10 64bit, KDE-Plasma 5.4.2, Qt 5.4.2, gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2). Did you change anything on your system which could lead to this behaviour?

As a workaround you could try to comment this lines here. As there area lot of time misalignments in this tlog those string operations are triggered very often which seems to be the problem....

AndKe commented 8 years ago

True, you are most likely onto something here, I did upgrade all (3) computers to Ubuntu 16.04 64bit recently : gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2) Qt version 5.5.1

Qt 5.5 was strongly suggested by Bill long time ago, so I've used some upgraded version for a while. It's very possible that something in 16.04 is causing this. Ironically, I tested lots of essential apps before upgrading (both compiled from source and downloaded) - and this is the only hiccup I've seen.

AndKe commented 8 years ago

So I installed a new, fresh ubuntu in a VM, and tested: (qmake qgrouncrontrol.pro && make -j8) ubuntu's QT 5.5.1 = crash reproducible installed QT 5.5.1 = crash reproducible installed QT 5.4.2 = crash reproducible installed QT 5.6.0 = won't compile and looks like:

src/uas/UASManager.cc: In member function ‘bool UASManager::setHomePosition(double, double, double)’:
src/uas/UASManager.cc:76:19: error: ‘isnan’ was not declared in this scope
     if (!isnan(lat) && !isnan(lon) && !isnan(alt)
                   ^
src/uas/UASManager.cc:76:19: note: suggested alternative:

I think we can exclude Qt from being the cause of this bug, as long we expect your 5.4.2 being the same that i am testing (installed by "qt-unified-linux-x64-2.0.3-online.run")

So. Is there anything else I could test ? (and how?)

Arne-W commented 8 years ago

okay - nice you tried all this and fu\ that nothing works. I will test a 16.04 Ubuntu today perhaps it is really a 16.04 issue. At the moment I have no Idea what you could test. First I have to test the 16.04 on my side. I will inform you as soon as I have some results.

Arne-W commented 8 years ago

I have exactly the same issue in my VM using Ubuntu 16.04. It crashes randomly with the logs you provided. I tested the native ubuntu QT as is was the easiest to install. So I think IT IS Ubuntu 16.04 related. I am really sorry for you but there is nothing I can do for you. If you have time and want to make a proof - install a 15.10 or a 14.04 and try. I am very sure the planner will work. But I think we do not have to verify this issue any further from my point of view there is no other possibility.

@billbonney Perhaps you can do a check if you have already an Ubuntu 16.04 installed. At the moment I would say "Apm planner is not compatible with Ubuntu 16.04"

AndKe commented 8 years ago

Do we know if executing a apm_planner built on 15.10 , would misbehave on 16.04 ? aka: are we sure it's the executing enviroment, not the place the program is built that causes the problem ?

Arne-W commented 8 years ago

Yes we know - I tested a few seconds ago :smiley: I have a ubuntu 14.04 with Qt 5.5.1 and a ubuntu 16.04 with Qt 5.5.1. I build it on 14.04 and everything works fine no crashes no other issues (loaded log 5 times). Then copied the build into the 16.04 VM and it crashes on the first time the log is loaded. So sorry....

AndKe commented 8 years ago

you beat me to it, but I was about to crosstest myself between virtual 15.10 and 16.04 AP2 compiled on 16.04 and 15.10 works well on 15.10 AP2 compiled on 16.04 and 15.10 crashes during the tlog loading on 16.04

So the conclusion is that there's something fishy with the execution enviroment, not building tools. I am afraid this is to vague to report to Canonical, and I have not found one other application that misbehaves.

AndKe commented 8 years ago

In an attempt to narrow down the cause: I did ldd apmplanner on both 15.10 and 16.04 copied out all files mentioned in ldd compared the files , the list below is a list of files that are used, and are not binary same since 15.10

#!/bin/bash
cp  libsndfile.so.1 /usr/lib/x86_64-linux-gnu/libsndfile.so.1
cp  libasound.so.2 /usr/lib/x86_64-linux-gnu/libasound.so.2
cp  libcrypto.so.1.0.0 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
cp  libSDL2-2.0.so.0 /usr/lib/x86_64-linux-gnu/libSDL2-2.0.so.0
cp  libflite_cmu_us_kal.so.1 /usr/lib/x86_64-linux-gnu/libflite_cmu_us_kal.so.1
cp  libflite.so.1 /usr/lib/x86_64-linux-gnu/libflite.so.1
cp  libQt5OpenGL.so.5 /usr/lib/x86_64-linux-gnu/libQt5OpenGL.so.5
cp  libQt5Svg.so.5 /usr/lib/x86_64-linux-gnu/libQt5Svg.so.5
cp  libQt5PrintSupport.so.5 /usr/lib/x86_64-linux-gnu/libQt5PrintSupport.so.5
cp  libQt5Widgets.so.5 /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5
cp  libQt5Quick.so.5 /usr/lib/x86_64-linux-gnu/libQt5Quick.so.5
cp  libQt5Gui.so.5 /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5
cp  libQt5Qml.so.5 /usr/lib/x86_64-linux-gnu/libQt5Qml.so.5
cp  libQt5Network.so.5 /usr/lib/x86_64-linux-gnu/libQt5Network.so.5
cp  libQt5Sql.so.5 /usr/lib/x86_64-linux-gnu/libQt5Sql.so.5
cp  libQt5SerialPort.so.5 /usr/lib/x86_64-linux-gnu/libQt5SerialPort.so.5
cp  libQt5Script.so.5 /usr/lib/x86_64-linux-gnu/libQt5Script.so.5
cp  libQt5Test.so.5 /usr/lib/x86_64-linux-gnu/libQt5Test.so.5
cp  libQt5Core.so.5 /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
cp  libstdc++.so.6 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
cp  libm.so.6 /lib/x86_64-linux-gnu/libm.so.6
cp  libgcc_s.so.1 /lib/x86_64-linux-gnu/libgcc_s.so.1
cp  libc.so.6 /lib/x86_64-linux-gnu/libc.so.6
cp  libvorbisenc.so.2 /usr/lib/x86_64-linux-gnu/libvorbisenc.so.2
cp  libdl.so.2 /lib/x86_64-linux-gnu/libdl.so.2
cp  libpthread.so.0 /lib/x86_64-linux-gnu/libpthread.so.0
cp  librt.so.1 /lib/x86_64-linux-gnu/librt.so.1
cp  ld-linux-x86-64.so.2 /lib64/ld-linux-x86-64.so.2
cp  libpulse.so.0 /usr/lib/x86_64-linux-gnu/libpulse.so.0
cp  libsndio.so.6.1 /usr/lib/x86_64-linux-gnu/libsndio.so.6.1
cp  libXi.so.6 /usr/lib/x86_64-linux-gnu/libXi.so.6
cp  libwayland-egl.so.1 /usr/lib/x86_64-linux-gnu/libwayland-egl.so.1
cp  libwayland-client.so.0 /usr/lib/x86_64-linux-gnu/libwayland-client.so.0
cp  libwayland-cursor.so.0 /usr/lib/x86_64-linux-gnu/libwayland-cursor.so.0
cp  libflite_usenglish.so.1 /usr/lib/x86_64-linux-gnu/libflite_usenglish.so.1
cp  libflite_cmulex.so.1 /usr/lib/x86_64-linux-gnu/libflite_cmulex.so.1
cp  libgobject-2.0.so.0 /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
cp  libglib-2.0.so.0 /lib/x86_64-linux-gnu/libglib-2.0.so.0
cp  libpng12.so.0 /lib/x86_64-linux-gnu/libpng12.so.0
cp  libGL.so.1 /usr/lib/nvidia-361/libGL.so.1
cp  libproxy.so.1 /usr/lib/x86_64-linux-gnu/libproxy.so.1
cp  libudev.so.1 /lib/x86_64-linux-gnu/libudev.so.1
cp  libicui18n.so.55 /usr/lib/x86_64-linux-gnu/libicui18n.so.55
cp  libicuuc.so.55 /usr/lib/x86_64-linux-gnu/libicuuc.so.55
cp  libpcre16.so.3 /usr/lib/x86_64-linux-gnu/libpcre16.so.3
cp  libvorbis.so.0 /usr/lib/x86_64-linux-gnu/libvorbis.so.0
cp  libpulsecommon-8.0.so /usr/lib/x86_64-linux-gnu/pulseaudio/libpulsecommon-8.0.so
cp  libdbus-1.so.3 /lib/x86_64-linux-gnu/libdbus-1.so.3
cp  libbsd.so.0 /lib/x86_64-linux-gnu/libbsd.so.0
cp  libxcb.so.1 /usr/lib/x86_64-linux-gnu/libxcb.so.1
cp  libffi.so.6 /usr/lib/x86_64-linux-gnu/libffi.so.6
cp  libpcre.so.3 /lib/x86_64-linux-gnu/libpcre.so.3
cp  libfreetype.so.6 /usr/lib/x86_64-linux-gnu/libfreetype.so.6
cp  libgraphite2.so.3 /usr/lib/x86_64-linux-gnu/libgraphite2.so.3
cp  libGLX.so.0 /usr/lib/nvidia-361/libGLX.so.0
cp  libGLdispatch.so.0 /usr/lib/nvidia-361/libGLdispatch.so.0
cp  libicudata.so.55 /usr/lib/x86_64-linux-gnu/libicudata.so.55
cp  libsystemd.so.0 /lib/x86_64-linux-gnu/libsystemd.so.0
cp  libasyncns.so.0 /usr/lib/x86_64-linux-gnu/libasyncns.so.0
cp  libXdmcp.so.6 /usr/lib/x86_64-linux-gnu/libXdmcp.so.6
cp  libselinux.so.1 /lib/x86_64-linux-gnu/libselinux.so.1
cp  libgcrypt.so.20 /lib/x86_64-linux-gnu/libgcrypt.so.20
cp  libnsl.so.1 /lib/x86_64-linux-gnu/libnsl.so.1
cp  libresolv.so.2 /lib/x86_64-linux-gnu/libresolv.so.2
cp  libgpg-error.so.0 /lib/x86_64-linux-gnu/libgpg-error.so.0

then I attempted to restored the files in the script above, onto a 16.04 system, just to see if it helped.

billbonney commented 8 years ago

Is it SDL2.0 bug again? Turn off audio.

AndKe commented 8 years ago

nope- in hope, I tried to disable audio too, - but it did not help. If you read above, you'll see this is a strange problem , similar to https://github.com/ArduPilot/apm_planner/issues/940

The thing that scares me the most, is that no other application seems to have these issues, normally I'd believe that is something new with Ubuntu 16.10 that will be fixed, but it's very hard to report to canonical, (not knowing where the problem is) - and not even related to Qt version...

Arne-W commented 8 years ago

Googeling around i found an issue that looks the same - but there is no solution until now. https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/1545921 And I made some backtraces perhaps someone has an Idea. 5 runs 5 crashes 5 different back traces. BackTraces.zip And here are the call stacks: StackTraces.zip

AndKe commented 8 years ago

@Arne-W The bug you found was "a duplicate of non-existing bug" I found a very similar bug (maybe the one that should be referred to) - https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/1027371 and the solution was apparently fixed in Calibre, not OS.

Arne-W commented 8 years ago

Yes, i have seen that - its a mess. At the moment i do not have any idea what to do next. We spend a lot of time analysing this bug and did not find any hint to solve this. I'm at my wit's end.

Arne-W commented 8 years ago

looking at stack trace 2 I saw this line: 3 malloc_printerr /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.23.so 5007 0x7ffff40dbf01 @AndKe have you ever watched the syslog or kernellog when the APM planner crashes?? I did not! :confused: At the moment I have no VM with 16.04 so I can't check that by myself - perhaps you could have a fast look?

AndKe commented 8 years ago

@Arne-W sure - glad to be of any help if possible - unfortunately I don't see any great help in syslog /dmesg/

andre@loke:~/prog$ dmesg |tail
[236421.650782] usb 5-1.3: USB disconnect, device number 11
[236423.384209] usb 5-1.3: new full-speed USB device number 12 using xhci_hcd
[236423.662219] usb 5-1.3: New USB device found, idVendor=26ac, idProduct=0011
[236423.662224] usb 5-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[236423.662227] usb 5-1.3: Product: PX4 FMU v2.x
[236423.662229] usb 5-1.3: Manufacturer: 3D Robotics
[236423.662231] usb 5-1.3: SerialNumber: 0
[236423.662458] usb 5-1.3: ep 0x81 - rounding interval to 64 microframes, ep desc says 80 microframes
[236423.664330] cdc_acm 5-1.3:1.0: ttyACM0: USB ACM device
[241087.185928] usb 5-1.3: USB disconnect, device number 12
andre@loke:~/prog$ tail /var/log/syslog
May  4 20:29:06 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:8a1:c41d:55bc:c959 on eno1.
May  4 20:29:06 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:216d:f4b0:4d44:aa34 on eno1.
May  4 20:29:06 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:3179:9cc7:a6f3:dab0 on eno1.
May  4 20:29:07 loke avahi-daemon[983]: Registering new address record for 2a03:8000:3f2:2a00:216d:f4b0:4d44:aa34 on eno1.*.
May  4 20:29:07 loke avahi-daemon[983]: Registering new address record for 2a03:8000:3f2:2a00:3179:9cc7:a6f3:dab0 on eno1.*.
May  4 20:29:07 loke avahi-daemon[983]: Registering new address record for 2a03:8000:3f2:2a00:8a1:c41d:55bc:c959 on eno1.*.
May  4 20:29:07 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:8a1:c41d:55bc:c959 on eno1.
May  4 20:29:07 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:216d:f4b0:4d44:aa34 on eno1.
May  4 20:29:07 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:3179:9cc7:a6f3:dab0 on eno1.
May  4 20:29:07 loke NetworkManager[1056]: <info>  [1462386547.4950] policy: set 'Wired connection 1' (eno1) as default for IPv6 routing and DNS

##############CRASH 

andre@loke:~/prog$ dmesg |tail
[236423.384209] usb 5-1.3: new full-speed USB device number 12 using xhci_hcd
[236423.662219] usb 5-1.3: New USB device found, idVendor=26ac, idProduct=0011
[236423.662224] usb 5-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[236423.662227] usb 5-1.3: Product: PX4 FMU v2.x
[236423.662229] usb 5-1.3: Manufacturer: 3D Robotics
[236423.662231] usb 5-1.3: SerialNumber: 0
[236423.662458] usb 5-1.3: ep 0x81 - rounding interval to 64 microframes, ep desc says 80 microframes
[236423.664330] cdc_acm 5-1.3:1.0: ttyACM0: USB ACM device
[241087.185928] usb 5-1.3: USB disconnect, device number 12
[241249.712437] apmplanner2[3814]: segfault at 0 ip 00007f13c1e71a28 sp 00007ffeacc34500 error 4 in libQt5Gui.so.5.5.1[7f13c1c97000+527000]
andre@loke:~/prog$ tail /var/log/syslog
May  4 20:30:06 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:216d:f4b0:4d44:aa34 on eno1.
May  4 20:30:06 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:3179:9cc7:a6f3:dab0 on eno1.
May  4 20:30:07 loke avahi-daemon[983]: Registering new address record for 2a03:8000:3f2:2a00:216d:f4b0:4d44:aa34 on eno1.*.
May  4 20:30:07 loke avahi-daemon[983]: Registering new address record for 2a03:8000:3f2:2a00:3179:9cc7:a6f3:dab0 on eno1.*.
May  4 20:30:07 loke avahi-daemon[983]: Registering new address record for 2a03:8000:3f2:2a00:8a1:c41d:55bc:c959 on eno1.*.
May  4 20:30:07 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:8a1:c41d:55bc:c959 on eno1.
May  4 20:30:07 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:216d:f4b0:4d44:aa34 on eno1.
May  4 20:30:07 loke avahi-daemon[983]: Withdrawing address record for 2a03:8000:3f2:2a00:3179:9cc7:a6f3:dab0 on eno1.
May  4 20:30:07 loke NetworkManager[1056]: <info>  [1462386607.5119] policy: set 'Wired connection 1' (eno1) as default for IPv6 routing and DNS
May  4 20:30:12 loke kernel: [241249.712437] apmplanner2[3814]: segfault at 0 ip 00007f13c1e71a28 sp 00007ffeacc34500 error 4 in libQt5Gui.so.5.5.1[7f13c1c97000+527000]
andre@loke:~/prog$ 

kern.log contains same information too, nothing more.

AndKe commented 8 years ago

the crash seems to only occur on tlogs that produce text like:

ERROR 2016-05-04T20:37:38.190 Corrupt data read: Time is not increasing! Last valid time stamp: "69271"  actual read time stamp is: "69110" 
ERROR 2016-05-04T20:37:38.210 Corrupt data read: Time is not increasing! Last valid time stamp: "75531"  actual read time stamp is: "75510" 
ERROR 2016-05-04T20:37:38.211 Corrupt data read: Time is not increasing! Last valid time stamp: "76131"  actual read time stamp is: "76110" 

I am now collecting one huge log with no preflight reboots, not multiple flights.. smaller tests with such "linear time" logs worked fine.

AndKe commented 8 years ago

Eureka ! given all the GUI related crashes, I commented out https://github.com/ArduPilot/apm_planner/blob/master/src/ui/AP2DataPlotThread.cc#L1236-L1238

billbonney commented 8 years ago

I should be paying attention to those commits. tempVal is not a good name for a variable! maybe currentTimestamp would be better ;-)

QLOG is supposed to be thread safe, maybe there is issue with that. I'd have to look. But it might also be that the corruptTimeRead() method operating on a structure from another thread that is reading it. i.e. the UI thread.

billbonney commented 8 years ago

It looks like QS_LOG is compiled in without thread safety on https://github.com/ArduPilot/apm_planner/blob/master/QsLog/QsLog.pri#L4 Fix that and you fix the problem, Would be my guess. I'll try and test later, but if @AndKe @Arne-W you feel like testing, go ahead.

billbonney commented 8 years ago

PS: newer version of QS_LOG here https://bitbucket.org/codeimproved/qslog/src/a30fd8f463ed16dbb11761ccf139bdcd4bd4aad6/QsLog.h?fileviewer=file-view-default#QsLog.h-35

AndKe commented 8 years ago

Can I just replace the file ? I am mostly programming microcontrollers, so securing multi thread execution is not something I know how to do. On May 5, 2016 8:02 AM, "Bill Bonney" notifications@github.com wrote:

PS: newer version of QS_LOG here https://bitbucket.org/codeimproved/qslog/src/a30fd8f463ed16dbb11761ccf139bdcd4bd4aad6/QsLog.h?fileviewer=file-view-default#QsLog.h-35

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ArduPilot/apm_planner/issues/941#issuecomment-217082249

Arne-W commented 8 years ago

WTF!?!? I did not recognize the QLOG is a third party tool - thought it belongs to Qt itself. @AndKe Did you make it?? I guess replacing just the QsLog.h is not enough - but uncommenting the line Bill mentioned above should do the trick for a first test.

AndKe commented 8 years ago

Will uncomment and try later, out with family now. On May 5, 2016 11:16, "Arne Wischmann" notifications@github.com wrote:

WTF!?!? I did not recognize the QLOG is a third party tool - thought it belongs to Qt itself. @AndKe https://github.com/AndKe Did you make it?? I guess replacing just the QsLog.h is not enough - but uncommenting the line Bill mentioned above should do the trick for a first test.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ArduPilot/apm_planner/issues/941#issuecomment-217109986

AndKe commented 8 years ago

Good news ! un-commenting: DEFINES += QS_LOG_SEPARATE_THREAD # messages are queued and written from a separate thread makes the problem go away. I guess you have a graceful solution now :)

billbonney commented 8 years ago

I've pushed a fix a2b1ac0521319321934a0bed58019e1a03b1ee39

I think the reason it was disabled is that it would lock on exit. The new version of the lib I have used, fixes that

see https://github.com/victronenergy/QsLog and this comment https://github.com/victronenergy/QsLog/blob/master/QsLogReadme.txt#L58 in particular

Let me know if you have issues and if not we can close

AndKe commented 8 years ago

Hi, I am unable to build current master due to

src/main.cc: In function ‘int main(int, char**)’:
src/main.cc:82:67: error: ‘QsLogging::LogRotationOption’ is not a class or namespace
                                                        QsLogging::LogRotationOption::EnableLogRotation,
                                                                   ^
Makefile:32212: recipe for target 'build-release/obj/main.o' failed
make: *** [build-release/obj/main.o] Error 1
billbonney commented 8 years ago

fixed with 5e8e82d it was only a warning with clang on OSX

AndKe commented 8 years ago

Current master crashes now like this: (on .tlog with timejump)

ERROR 2016-05-06T11:32:53.949 Corrupt data read: Time is not increasing! Last valid time stamp: "287753"  actual read time stamp is: "287608" 
ERROR 2016-05-06T11:32:53.950 Corrupt data read: Time is not increasing! Last valid time stamp: "288234"  actual read time stamp is: "288208" 
*** Error in `/home/andre/prog/apm_planner/release/apmplanner2': malloc(): smallbin double linked list corrupted: 0x0000000004bc4c30 ***
ERROR 2016-05-06T11:32:53.951 Corrupt data read: Time is not increasing! Last valid time stamp: "288717"  actual read time stamp is: "288611" 
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x77725)[0x7ffff40d1725]
/lib/x86_64-linux-gnu/libc.so.6(+0x81f01)[0x7ffff40dbf01]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7ffff40dd5a4]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x18)[0x7ffff49cfe78]

full stacktrace here: https://gist.github.com/AndKe/c21aeaae09394c748e9091421b9fd776

UPDATE: It's still able to crash like this

ERROR 2016-05-06T12:58:05.849 Corrupt data read: Time is not increasing! Last valid time stamp: "409684"  actual read time stamp is: "409618" 

Thread 1 "apmplanner2" received signal SIGSEGV, Segmentation fault.
0x00007ffff5b777f1 in ?? () from /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5

and

Thread 1 "apmplanner2" received signal SIGSEGV, Segmentation fault.
0x00007ffff40d84f5 in malloc_consolidate (av=av@entry=0x7ffff441db20 <main_arena>) at malloc.c:4184
4184    malloc.c: No such file or directory.

which way seems random

AndKe commented 8 years ago

sorry for bombing this thread with info, but I think it's useful: here is a crash during USB connection , proving not only log-related output is capable of trigging this: (on current master)

DEBUG 2016-05-06T13:26:51.286 Param: "SERIAL1_BAUD" :  QVariant(int, 57) 
DEBUG 2016-05-06T13:26:51.410 Param: "TELEM_DELAY" :  QVariant(int, 0) 
DEBUG 2016-05-06T13:26:51.411 Update RTL_ALT 1500 
DEBUG 2016-05-06T13:26:51.411 Param: "RTL_ALT" :  QVariant(int, 1500) 
DEBUG 2016-05-06T13:26:51.411 RNGFND param: "RNGFND_GAIN"  value: QVariant(double, 0.8) 
DEBUG 2016-05-06T13:26:51.411 Param: "RNGFND_GAIN" :  QVariant(double, 0.8) 

Thread 1 "apmplanner2" received signal SIGSEGV, Segmentation fault.
0x00007ffff5b6d2b0 in QTextEngine::shapeText(int) const () from /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5
(gdb) quit
A debugging session is active.

Flashing firmware is also enough to make the QLOG crash it:

 INFO 2016-05-09T11:11:11.080 flashing: 111000 / 986680 
 INFO 2016-05-09T11:11:11.105 flashing: 114000 / 986680 
 INFO 2016-05-09T11:11:11.130 flashing: 117000 / 986680 
Segmentation fault (core dumped)
billbonney commented 8 years ago

I'm seeing the same issue. Trying to understand how I can fix

Arne-W commented 8 years ago

Looking at the backtraces and all other infos collected here, it seems that the crash is somehow related to the logging but I don't think it is the culprit. The crash happens at several different locations in the AP2DataPlotThread. Only thread 1 is running at the same time which seems to be the UI thread. Very very often one of the threads is allocating or deallocating memory when the crash occurs. Just to share my thoughts...

AndKe commented 8 years ago

what must I comment out in QsLog.cpp to disable all console/log output ? I need apmplanner to work reliably today.

Arne-W commented 8 years ago

@AndKe I think uncommenting this line could help https://github.com/ArduPilot/apm_planner/blob/master/QsLog/QsLog.pri#L3

AndKe commented 8 years ago

@Arne-W Thank you, works fine. (I am actually in field, using mavlink inspector and gui for tuning on new planes, much faster than using MAVproxy - but it was crashing all the time)

Arne-W commented 8 years ago

@AndKe could you please test the following: Enable the logging again and make sure the thread is enabled. Change this line from QsLogging::MaxSizeBytes(0) to QsLogging::MaxSizeBytes(1024000) Hopefully it fixes the issue...

dcarpy commented 8 years ago

@Arne-W Hmm...that seemed to have worked (increasing from 0 to 1024000). Well, on Windows anyways. :+1:

dcarpy commented 8 years ago

Linux too!

AndKe commented 8 years ago

Good morning. (testing on Ubuntu 16.04) After increasing MaxSizeBytes I have made some unusual observations; On first tlog opening (the easiest way to reproduce) - it may, or may not crash (if a very small/OK tlog is used - then few time messages will be shown )

this is after the first small .tlog file graph is opened, twice, and closed twicegraph:

ERROR 2016-05-11T08:12:21.697 Corrupt data read: Time is not increasing! Last valid time stamp: "199718"  actual read time stamp is: "199638" 
ERROR 2016-05-11T08:12:21.698 Corrupt data read: Time is not increasing! Last valid time stamp: "200259"  actual read time stamp is: "200238" 
QSqlDatabasePrivate::removeDatabase: connection '{3abf9091-646d-4d30-8dca-94ab8c00a182}' is still in use, all queries will cease to work.
QSqlDatabasePrivate::removeDatabase: connection '{ae16b240-8737-47a9-94c1-77843b5b605a}' is still in use, all queries will cease to work.

now, any subsequent .tlog loading won't crash just because the terminal is no longer updated with those warnings. QSqlDatabasePrivate::removeDatabase: connection lines will appear in terminal after closing logs, but no QLOG activity.

so it may seem as the application works fine, but in reality it just stopped QLOG'ging.

messages written to terminal that does not use QLOG , will still work (examnple):

qml: APMToolBar: CONFIG/TUNING SELECTED
qml: APMToolBar: clear selected buttons
qml: APMToolBar: FLIGHT PLAN unselected

When the application is then "closed" (main windows is gone) - the terminal session is not freed, only after Ctrl-C in terminal the process apmplanner2 will actually end.

Arne-W commented 8 years ago

AHHRG - I hate this bug. Okay the reason why I thought increasing MaxSizeBytes will solve this issue was that since Bill updated the QsLog it crashed on my 15.10 too! The crash looked a bit different but I thought it was related to the update. Disabling the file logger made the problem go away. Therefore I go the Idea with the file size with the assumption that the file size of 0 would lead to problems when writing into a file that closes as soon as you try to write. This seems to be a real QsLog bug cause a size of 0 should be handled. Most likely this is the bug which lead to the windows crash, as it leads to a crash on my Ubuntu 15.10 too.

@AndKe the problem you are describing seems to be a problem with the logger thread in the QsLog - thanks for the testing :+1: Unfortunately I destroyed my 16.04 VM today so I could not test by myself on a 16.04. So the next thing to do would be to turn off the thread (comment this line) and turn off the file logging (comment this line). And now - Try Again !!!

dcarpy commented 8 years ago

@Arne-W Keep the file size at 1024000 or put back to 0?

Arne-W commented 8 years ago

That should make no difference as the logger is created but not used.

AndKe commented 8 years ago

@Arne-W based on current master (size back to 0) I disabled file logging (Line88 commented out):

I did load a few tlogs until crash(and also observed the "no more terminal output" after the first tlog.) , on next run crashed at first attempt. same libQt5Gui.so.5

AndKe commented 8 years ago

after the test in previous post, (with L88 still disabled) - I commented the "own-thread" now It did something I never seen before, no crash, just "stop"

it stopped with the "time not increasing" messages, (at 41% of the log) - then a few sec later the threads exited"
the application allowed me to press cancel, then close the graph before stopped responding.

ERROR 2016-05-11T21:05:45.558 Corrupt data read: Time is not increasing! Last valid time stamp: "487137"  actual read time stamp is: "487021" 
ERROR 2016-05-11T21:05:45.559 Corrupt data read: Time is not increasing! Last valid time stamp: "487754"  actual read time stamp is: "487619" 
[Thread 0x7fffb8d27700 (LWP 12598) exited]
[Thread 0x7fffaa7fc700 (LWP 12602) exited]
[Thread 0x7fffa97fa700 (LWP 12604) exited]
[Thread 0x7fffa9ffb700 (LWP 12603) exited]
[Thread 0x7fffa8ff9700 (LWP 12605) exited]
[Thread 0x7fffab7fe700 (LWP 12600) exited]
[Thread 0x7fffb97fd700 (LWP 12597) exited]
[Thread 0x7fff8bfff700 (LWP 12606) exited]
[Thread 0x7fffabfff700 (LWP 12599) exited]
[Thread 0x7fffaaffd700 (LWP 12601) exited]
[Thread 0x7fff8affd700 (LWP 12609) exited]
Arne-W commented 8 years ago

All right :confused: First of all thanks for testing :+1:
I will do some tests tomorrow - ran out of ideas for now. @dcarpy you said Linux too - which linux?

dcarpy commented 8 years ago

@Arne-W On 16.04. Tough problem, I hope it's not driving u nuts. Feel free to ask me for a Windows or Linux build...anything to try to help.

billbonney commented 8 years ago

I also thought I had fixed it, then it cashed after a period of time. It's driving me nuts. I didn't have enough time to dig deeper.