d-ronin / dRonin

The dRonin flight controller software.
http://dronin.org
Other
289 stars 167 forks source link

Sparky reboots on inputs setup #227

Closed jhitesma closed 8 years ago

jhitesma commented 8 years ago

Using recent builds of next I'm getting reboots on my sparky board when setting up inputs with a PPM RX.

I've tried two different RX's (though admittedly both homemade FlySky RX's so I'm not 100% sure there's not a hardware issue going on.) and tried powering the RX both through the sparky and directly off the BEC.

Either way my board is rebooting near the end of the inputs wizard (between the center the sticks step to just after the confirm failsafe step) or if I try manual setup it's rebooting almost as soon as I try to change anything.

mlyle commented 8 years ago

@jhitesma and @dustin have both seen various sparky1 anomalies

I don't know why sparky1 would be different, and there's other possible explanations for each of these, but the evidence of badness is adding up.

jhitesma commented 8 years ago

Ok, I can cause this to happen really easily now :) Starting Manual config does it every time with a PPM RX setup and working. If the RX is disconnected or not powered - no crash. RX powered - insta-crash on manual calibration.

Reboot Cause is independent watchdog timer

I watched TaskInfo-Stack->StackRemaining and didn't see anything jump...but it crashed so quick something could have changed before I got an update. ManualControl was at 240...it is at 464 after a boot and before I enable manual calibration. After the crash it stays at 240.

SystemSats I'm seeing the following right after boot and no change at or after crash (just the Flight Time resets) HeapRemaining 14160 CPULoad 42 IRQStackRemaining 512

mlyle commented 8 years ago

Can you please screenshot stackremaining just at baseline, before manual calibration? (but with RX configured)

mlyle commented 8 years ago

This seems to happen on Sparky1 with PPM and input calibration wizard. With the event system changes reverted the functionality works. This is the stacks at the end of that:

So telemetry is very close. But turning up telemetry stack size before didn't seem to help.

jhitesma commented 8 years ago

Just to clarify it happens with both the wizard and the manual setup. In fact the manual setup it happens almost immediately every time. With the wizard it happens at variable points - sometimes as soon as the center all sticks page, sometimes not until the correct backwards channels page.

Seems like it's happening when the GCS starts reading the stick positions faster is my impression.

mlyle commented 8 years ago

Interesting. So it doesn't go bad with channel identification, etc? And that all seems fine?

This points even more towards telemetry.

jhitesma commented 8 years ago

Yeah, it gets through the channel identification in the wizard. It's when it starts reading all the channels faster that all hell breaks loose. Easy way to repo is to just enable manual config - I get 2-3 updates and boom. But stacks still look good :( (Unless it's just the data in the browser isn't updated fast enough for me to see them go boom.)

jihlein commented 8 years ago

Confirmed same problem on AQ32 with PPM receiver.

tracernz commented 8 years ago

On mine, on next(0b502efed77eaa4a916736673d00383784023548), @jhitesma's repro steps just break the telemetry connection but the board keeps running, same as #98 does for me now.

tracernz commented 8 years ago

I just compiled Sparky with FreeRTOS and it doesn't experience this issue.

tracernz commented 8 years ago

Branch is at https://github.com/tracernz/dronin/tree/sparky-now-with-less-crash if you want to test/confirm the problem is only present with ChibiOS @jhitesma.

jhitesma commented 8 years ago

Already discussed in IRC but to keep this up to date. Confirmed that issue does not repo on @tracernz 'sparky-now-with-less-crash' branch.

jihlein commented 8 years ago

Using latest next, and erasing the settings sector first, I can get thru the entire setup on AQ32 hardware with no issues. Setup PPM, sync PWM, tail servo at 330 Hz, battery monitor, FrSky sensor hub, logging to openlog, gps, and external mag. Was able to calibrate level and orient the board (I have it installed facing 180 degrees). To be sure, I erased the settings and repated the above twice with no issues, This is with a build from Windows. I'm trying to setup an OSX build environment to cross check things on too.

mluessi commented 8 years ago

I can reproduce this on Brain using the latest next and GCS on Linux. The board usually resets during the step where all sticks are moved to the maximum extents.

mluessi commented 8 years ago

The above test was done on 21e0f0bcbae4f427a69728532300b15f912ea447 interestingly, it works (doesn't crash) when erasing all settings are erased, but when using the settings below it crashes.

https://gist.github.com/mluessi/8020ef81255fe5321cae

pug398 commented 8 years ago

I have had this problem pretty much all along. If you import settings for one type rx and try to use the wizard there were always weird things happening from crazy stick inputs to resets. Always recommended if changing rx to erase all settings.If it consistently corrupts after applying settings file I wonder what results you would get if you delete just the "manual control settings" object group from the uav file and write it back to a clean board?

mlyle commented 8 years ago

@pug398-- it's great that it gives me a consistent repro of this crash.

Just to confirm-- are you saying that you've gotten board resets while in the input wizard "limits" mode on Tau? Because that's very helpful if true.

pug398 commented 8 years ago

Yes but on different fc. Tau has done it all along. Apply settings file and then try to run radio wizard has always had issues. That is why I always recommend if not using the exact same radio setup to always erase settings. Not positive it is same issue but I have always avoided running the radio wizard after importing uav file because of crazy results. If it can be repo'd then perhaps it can finally be tracked down. I thought there may be an issue with writing the manual control uavo back.

pug398 commented 8 years ago

If you cut the object out of the uav file it never shows up in the import. Otherwise the object loads in gcs but supposedly unchecking it does not save any data to board after reboot. Although they should have the same end result it is possible they do not.

pug398 commented 8 years ago

ok a fresh sparky1 setup reboots as soon as you get to the max limits page on Preview so going back further...

pug398 commented 8 years ago

TL AE25 built on 11/30 does not have the issue. Preview 12/2 does. This is something different and very repeatable. Preview flghtware built on 12/2 connected to AE25 GCS still has issue.

pug398 commented 8 years ago

sbus has issue as well just takes longer to go bonkers. Anybody do anything to affect the timers?

mlyle commented 8 years ago

No changes to timers. I wouldn't worry about bisecting this one, as it's not likely to be helpful.

The event system changes "cause" this. But since it's been a pre-existing problem (and there's other evidence for that)... there's been a race condition in existing Tau code for some time-- we've just made it more frequent. (This is a good thing, so it can be eliminated).

There's a lot of evidence it's a bug in the telemetry system, or closely-related portions of the uav object manager.

pug398 commented 8 years ago

It would seem to point back to the telemetry issue as I have flown on sbus without any control issues on two different platforms.

mlyle commented 8 years ago

Yes, sparky seems to fly fine-- this is an input wizard thing. Note I can only reproduce this with the brain thing above. People seem to gain and lose reproducibility of this bug even when running the same code and using the same procedure-- I used to be able to trigger it on Quanton but can't anymore. Hopefully in the next couple days I can run it down.

In the longer term, the flight side of telemetry needs to be rewritten.

jhitesma commented 8 years ago

Testing this in IRC with @mlyle tonight it appears to be related to flight modes. With all flight modes set the same it doesn't repo. With flight mode options set differently (Say Acro/Leveling/Horizon instead of Leveling/Leveling/Leveling) it repos.

I also tried with and without outputs configured and confirmed that didn't make any difference.

mlyle commented 8 years ago

Fixed by #273