FIRST-Tech-Challenge / FtcRobotController

BSD 3-Clause Clear License
695 stars 4k forks source link

An exception in a VisionProcessor's onDrawFrame(...) method causes unstable robot behavior #746

Closed BladeBot closed 5 months ago

BladeBot commented 9 months ago

This has been a difficult issue to untangle, but I think wee have finally figured out what is going on. We have experienced several instances of strange and sometimes unsafe robot behavior since the release of version 8.2, and this issue may explain it. Apologies for the issue length, I tend to just keep writing, but I want all of the details documented in case it helps.

If any uncaught exception occurs in a VisionProcessor's onDrawFrame(...) method (or presumably anything else running on that thread), the robot program will partially crash, but parts of the program will continue to run in an unstable state. The resulting behavior varies wildly, sometimes resulting in unsafe situations such as the robot not obeying the driver station's stop button.

Example Project

A minimal example that reproduces this issue can be found here:

https://github.com/BladeBot/onDrawFrameCrashDemo

This example uses SDK 9.0.1, and was tested on a Rev Control Hub using a Rev Driver Hub. Whether an expansion hub is present doesn't seem to matter.

Two motors labeled "Lift" and "Pivot" are present in this example, but are not needed for the issue to happen. They are for demonstrating whether the opmode is still running and controlling motors when it should not be. Lift follows an open loop sine wave, while pivot continuously rotates.

A webcam labeled "Webcam 1" is required.

10 seconds after the opmode is started, the ColorProcessor vision processor will divide by 0 in onDrawFrame, triggering the issue.

The example also has commented out vision code we were working on as an example of how this issue can be encountered naturally. It is a port of EasyOpenCV code, requiring a visualization to be programmed in onDrawFrame for feedback via the VisionPortal. This resulted in sneaky ConcurrentAccessExceptions that could occur with variable timing, presumably depending on how the threads lined up. It could happen immediately after VisionPortal setup, be absent for an entire meeting, or appear when color calibration values entered a certain range. The logging behavior described below made determining what was happening very difficult. (The ConcurrentAccessExceptions are not the issue here, just an example of a potential common trigger. It's an easy trap if you don't realize the two functions are on separate threads!)

I haven't seen every possible outcome described below with this minimal example, so it is possible that other code or dependencies in our robot projects was partly responsible for specific behavior (FTCDashboard in particular might be a culprit, though we never use the opmode controls in it). I think it's just down to the randomness of what happens and not having run this example nearly as much as student robot code. We haven't had any further incidents at all since fixing the ConcurrentAccessExceptions so I am quite confident that the issues all at least started with this bug occurring.

Behavior

When an exception occurs in onDrawFrame(), viewing the robot's "screen" via an HDMI monitor or something like scrcpy will show that the Robot Controller app has stopped along with a prompt to reopen it. No exception is recorded in the robot's logs, and none is shown on the Driver Station. What exactly happens next varies; rebooting the robot or redeploying code (if possible) seems to affect the results, even if the program is unchanged.

The current opmode may or may not continue running despite the "app" having "quit". If it stops, it may stop such that motors freeze at their last set speed for a second before stopping. If it continues running, more strange and inconsistent behavior occurs.

The most concerning effect we have observed is some sort of desync between the driver station and robot program, such that hitting stop on the driver's station shows the opmode has stopped when it in fact has not. If the opmode is an autonomous that was waiting for some sort of sensor input, it may take off unexpectedly! Usually the next button press on the driver station app will crash either it or the robot program.

Trying to restart the robot program such as by redeploying code or using the robot's screen can also have unusual effects. For example, the new app may immediately crash on opening, failing to replace the existing one until reboot. My understanding of the exception (which I don't have in front of me to paste here) was that it fails to get some sort of lock that is still held by the remnants of the previous app. (One hypothetical cause we thought of for the DS/Robot 'desync' is that parts of two robot programs end up running simultaneously?)

Expected Behavior / Solution

Regardless of the inconsistent outcomes, we think the correct solution is to make sure that an exception in onDrawFrame successfully halts the rest of the robot program just like any other exception in team written code. Preferably, it would also report the exception in the robot's logs and display it on the Driver Station.

Windwoes commented 9 months ago

Hey there, author of EOCV here....

Yes, exceptions in the frame handler should E-stop the OpMode, but (and we actually already discovered this a little bit ago), the new SDK broke the ABI and so EOCV is trying to call a function that doesn't exist while handling the crash, which causes a different crash.

One difficulty is that an exception on the OpMode thread by its very occurance terminates the OpMode, but an exception on the vision thread does not terminate the OpMode thread. EOCV attempts to work around this, but it's the workaround code that's crashing as described above.

Aside from that though, the app crashing should trigger a 2500ms no comms timeout from the Control Hub daughter board....

Windwoes commented 9 months ago

Side note: an interim workaround would be to manually guard the contents of your onDrawFrame() (and processFrame() too for that matter) functions with a try/catch for any type of exception.

Windwoes commented 7 months ago

This issue has been patched upstream in EOCV https://github.com/OpenFTC/EasyOpenCV/commit/beacd4939f8b588305e29d28c1d1c183f3ea0d68 and will be fixed in the next release.

Windwoes commented 5 months ago

This will be fixed in the next release of the FTC SDK.

Windwoes commented 5 months ago

Fixed in v9.1