DigitalChannel performance is disappointing and points to possible low-level communication oddities

aedancullen commented 7 years ago

Recently, @evolutionftc tested the digital I/O capabilities of the Modern Robotics Core Device Interface when connected to a ZTE Speed phone using ftc_app.

We wanted to evaluate the usefulness of the Device Interface in a situation where a non-I2C-compatible digital peripheral would need to be provided with an input signal quickly (generally, faster than 5ms). This is not a situation as simple as 'read the state of the button, write the state of the LED'; we intend to connect a digital sensor that requires data to be transmitted and received serially at a fast rate. This should not be a difficult requirement for any robotics control system to fulfill, as such non-I2C hardware that requires arbitrary digital communication is quite common. 5ms is actually quite a long time in the context of digital logic, and it would seem to me than any robotics control system should be able to achieve digital write and read operations even faster than this without difficulty.

I have observed that upon a call to DigitalChannel.setState() on a digital output, the Boolean value passed does not appear on another DigitalChannel directly connected to the first as an input until ~20-60ms later. This is really slow in the context of a robotics control system. Even an ATtiny microcontroller, for example, can easily beat this performance by many orders of magnitude.

I know that Android-based architecture of the FTC control system is mostly to blame, and that this might never change. Comparison of such a complex, non-realtime system with a microcontroller is also not necessarily reasonable. However, it would seem to me that with a 480Mbps USB bus, and a real-time microcontroller present within the actual Modern Robotics module, a response time faster than 20ms should be easily achievable.

The hardware configuration that has been used to test this behavior consists of the digital signal pins (the farthest left of the three in each set) on channels 0 and 1 connected directly together. The robot configuration contained no other hardware; channel 0 was named "OUT" and channel 1 "IN". The Core Device Interface was connected directly to the phone's USB bus, without the Power Distribution Module acting as a USB hub.

The following code snippet was used within a LinearOpMode. Assume that DigitalChannel variables out and in have been declared.

out = hardwareMap.digitalChannel.get("OUT");
out.setMode(DigitalChannelController.Mode.OUTPUT);

in = hardwareMap.digitalChannel.get("IN");
in.setMode(DigitalChannelController.Mode.INPUT);

// Guarantee that the output pin state is known before the test begins.
out.setState(false);
sleep(1000);

// Begin timing.
long start = System.nanoTime();

// Set pin state.
out.setState(true);

// Wait for the response.
while (in.getState() != true) {}

// Stop timing.
long elapsed = System.nanoTime() - start;

At this point, elapsed can be written to the system log or to the telemetry output (note that its value is in nanoseconds). Of course, this configuration times both the digital output operation and the digital input operation, and we know that either or both could be the cause of large elapsed times.

Our question to @tomeng70 and @ftctechnh is: Is there any way that digital channel performance can be improved within ftc_app, or is it a limitation of the Modern Robotics firmware or the Modern Robotics device protocol? We also wonder if the new REV Robotics control device (which looks very nice, by the way) will be any faster.

calebsander commented 7 years ago

One thing you could try is adding an idle to the loop. You are currently running a very tight loop that will reduce the amount of time the SDK has to execute its own code.

People on the forum have said that the main time sink is not the OS, but rather the overhead cost of setting up and tearing down the new USB connection that is created for each transaction.

qwertychouskie commented 7 years ago

@aedancullen What version of the SDK are you using? Version 2.62 (beta branch) contains "Changes to enhance Modern Robotics USB protocol robustness" and version 3 (alpha branch) contains "changes to the USB/FTDI layer" and a whole bunch of other improvements. I would recommend testing version 3 and seeing if it makes a difference.

cmacfarl commented 7 years ago

Version 2.61 had a fix for gyro related performance problems. Version 3.0, on the alpha branch currently, has many improvements at the FTDI driver level. I've done some profiling of i2c transactions and work is ongoing there, but I'd be curious to know whether you see any improvement in digital i/o latency.

aedancullen commented 7 years ago

Thanks @calebsander :) The main USB communication routine for a ModernRoboticsUsbDevice seems to take place in ReadWriteRunnableStandard.java's run() method. The run() method calls into a ReadWriteRunnableUsbHandler instance to do the actual reading and writing. It then calls into any instance implementing RobotUsbDevice for communication. One such class is RobotUsbDeviceFtdi. And, finally, we can see that it maintains an instance of FT_Device. FT_Device is an official implementation from com.ftdi....

An FT_Device reference is an instance variable of RobotUsbDeviceFtdi, and is assigned upon instantiation. A RobotUsbDevice reference (which could be a RobotUsbDeviceFtdi) is in turn assigned to an instance variable of ReadWriteRunnableUsbHandler upon instantiation. Then, ReadWriteRunnableUsbHandler is assigned to an instance variable of ReadWriteRunnableStandard. Finally, ReadWriteRunnableStandard becomes the mechanism for ModernRoboticsUsbDevice's write() and read() methods.

Since ModernRoboticsUsbDevice calls EventLoopManager.registerSyncdDevice() on a ReadWriteRunnable instance when its armDevice() is called, I think that the run() method of ReadWriteRunnableStandard is ultimately called immediately when a ModernRoboticsUsbDevice instance is armed. And the process continues down the many layers - ReadWriteRunnableStandard gets a ReadWriteRunnableUsbHandler, it in turn has an instance of RobotUsbDevice, and finally RobotUsbDeviceFtdi comes into the picture. I think that armDevice() is only called once, not every time some operation on the device has to be done, so it seems that there isn't a new connection (a new FT_Device) made every time... I could be wrong, but at first glance the run() loop seems to discover that there's work to be done using the method writeNeeded().

Good idea about my loop! I'll test adding a Thread.sleep() call to that. I have to investigate this more closely though, and take a look at the event-loop-related classes too.

@qwertychouskie and @cmacfarl - I'm already using the alpha branch :)

qwertychouskie commented 7 years ago

I would also try testing it on a phone better than the ZTE speed, which is stuck on Android 4.4, to see if it makes a difference. Any phone should work for testing, the (very short) list of allowed phones is only applicable to actual competition.

aedancullen commented 7 years ago

@qwertychouskie - Good idea! There are a few other Android phones here that I'll try. CPU usage isn't very high even on the Speed, though... I checked Android Monitor in Android Studio, and it doesn't seem like ftc_app is being starved of CPU or I/O time by other processes.

aedancullen commented 7 years ago

It does appear that ReadWriteRunnableStandard's run() is being called in a separate thread from the user-code thread. Maybe profiling the behavior of this run() method would help. In any case, this code is actually really interesting; RobotUsbDeviceFtdi allows the baud rate to be set. I wonder what baud rate it's currently using; for some reason Android Studio can't find any usages of this method. Additionally, the FTDI latency timer can be set... and this has to do with communication dynamics.

I'm also seeing seemingly arbitrary timeouts in ReadWriteRunnableUsbHandler and RobotUsbDeviceFtdi. Hmm... There's much experimentation to be done here, and this appears to be the location where issues like "Problem communicating with USB device" originate (ReadWriteRunnableStandard catches a ton of exceptions from the lower-level FTDI and USB device code in order to provide these messages). Thanks to everybody for the good suggestions.

qwertychouskie commented 7 years ago

I don't expect that the CPU is the bottleneck, but the cheap+outdated hardware and outdated software. (Android 4.4 was released October 31, 2013, the ZTE speed in December 2014.)

aedancullen commented 7 years ago

Upon further investigation, ModernRoboticsReaderWriter contains comments such as "we made this up" on things such as the maximum sequential USB error count. And, the delay after a communcation error has the same comment (the variable MS_COMM_ERROR_WAIT). So if there's a comm error, (even if there's one rarely) everything stops for 100ms, guaranteed. And, a 40ms wait is guaranteed if there's a 'failure' response from the Modern Robotics device. I don't understand why the device might think "we might be about to try to update it with a new version" (from a comment near that line). How often could this condition be occurring? The comments mention that the 'comm error' 100ms wait shouldn't happen often, but still: these are arbitrarily-chosen delays and it's not documented how often they should be expected to trigger.

It also seems that such arbitrary waits could cause the device to fall out of 'synchronization' - for example, if the 100ms wait does happen, possibly it will cause future problems for the readResponse() method, and cause additional time to be spent (up to MS_RESYNCH_TIMEOUT, which is one second) finding synchronization bytes. Again, I could be totally wrong, as I'm just skimming over the code quickly and don't have any protocol documentation from Modern Robotics; this is speculation. At first glance, though, there seems to be a lot of arbitrarily-chosen timing work being done here that could very easily slow the entire system down.

rgatkinson commented 7 years ago

I strongly suggest that y'all peruse the code in the 3.0 in the Alpha branch rather than the older code you've been splunking through above: much has changed in this area (hopefully for the better).

qwertychouskie commented 7 years ago

@rgatkinson see bottom of https://github.com/ftctechnh/ftc_app/issues/341#issuecomment-298133860

rgatkinson commented 7 years ago

Good; my mistake.

rgatkinson commented 7 years ago

Wait: something's up: there is no FT_Device in the 3.0 Alpha branch (it's FtDevice there); perhaps you might wish to double check your installation...

aedancullen commented 7 years ago

Yeah; sorry, I had forgotten to merge from upstream alpha in my fork of this repository. However, I'm still perplexed by the timing routines in ModernRoboticsReaderWriter.java.

@rgatkinson - Why might the Modern Robotics device think that "we might be about to try to update it with a new version"? This is confusing, and why does it require a 40ms delay? Also, the excerpt from the Modern Robotics spec at the top of ModernRoboticsReaderWriter.java mentions the use of timeouts, but not the use of waits; if the waits have been "made up" in order to be "very generous in trying to recover from failures as we read through possibly-old data trying to synchronize", doesn't it seem that this is the opposite of what should happen? If a communication error of any sort has occurred, it seems that this might be counterproductive...

I suppose my general question is: under what conditions will response.isFailure() return true, and why and when might response.isRead() and/or response.isWrite() be incorrectly set?

rgatkinson commented 7 years ago

Short version: in the normal course of events, none of these timeouts are actually ever hit, so they're not the root of the concern here that started this thread.

Slightly longer version: The 40ms delay has to do w/ avoiding inadvertently causing the MR controller into getting into its firmware update mode. Also: USB is a request response protocol from host to slave: there's nothing to synchronize to until the next request is sent out. Finally: error recovery (where the timeouts are relevant), especially in the face of ESD (which is where it mostll occurs) is, um, fun. You've just zapped the system with 20kV; all bets are off. The hardware gets confused, and at the same time we've got multiple staging points where data is buffered or packetized starting from the slave into its FTDI chip over the bus into Android and up into the SDK stack, all of which need to be drained if the protocol goes south from the point of view of the SDK. The 3.0 code does a much better job of dealing with all this than previous versions, though it's still not completely perfect (I suspect the Android USB chip can sometimes get wedged, and I haven't figured out how to reset that yet).

aedancullen commented 7 years ago

I know that the topic of this thread has drifted a bit. However, I'm really interested in this file, because I know that ESD issues have been the cause of much FTC concern over the last two years. I also know that this file is a low-level communication routine, which is critical to the operation of the robot. ftc_app is nicely architected, and I don't mean to suggest that something is definitely wrong in this file. I only want to dig deeper into the reasoning behind the arbitrary logic of this class, since it does have the capability to bring down a team's entire robot during a match (or, less noticeably, cause the entire system to run more slowly than it should. Even if none of these timeouts are actually never hit, one must wonder why they were added to the code in the first place.

The way I look at it, this file should have three purposes:

To read and write from the FTDI device.
To report problems during read and write operations with the FTDI device to the FTC team so that they can understand why their robot might be experiencing problems.
To recover as gracefully as possible from any communication mishaps, so that the robot's operation overall or the general speed of the system is affected only as a last resort. This is crucial in a competition environment.

The first is easy, and is already done.

To consider points 2 and 3, we have to distinguish between error reporting and error recovery, both of which fall under the category of error handling. Point 2 is error reporting, and point 3 is error recovery. Error reporting is important to FTC teams because they want to know what exactly went wrong with their robot when it does fail. Error recovery is important to FTC teams because during a match, even a partially-functioning, slightly-disabled robot (or, even a robot that stops for a while and then recovers before the end of the match) is better than an entirely-stopped robot. This class has the job of deciding how to deal with communication problems that occur on the field, whether ESD-related or not, and it decides when and if teams should receive the dreaded 'problem communicating with USB device' message. This is why it is such a critical part of the system.

I think we also have to distinguish between a timeout and a delay (or wait). A timeout is a valid error detection mechanism, which allows one device to typically decide that 'something is wrong' after an expected event does not occur after a certain period of time. Error detection is also part of the category of error handling. A timeout, however, is not an error recovery mechanism or an error reporting mechanism. A timeout neither fixes the communication problem that it detects nor analyzes it and reports its cause. A delay (or wait) is also neither an error recovery mechanism nor an error reporting mechanism. A delay cannot possibly detect the problem that has occurred during communication and report it, and in most cases cannot magically correct a problem. The only case in which a delay/wait is a robust and reasonable error-recovery mechanism for a communication system is when it's used in conjunction with exponential backoff. Arbitrary delays are not good, and the comment at the top of ModernRoboticsReaderWriter.java mentions that they were put in place to recover from failures in which possibly-old data must be sifted through. Waiting is the opposite of what should happen if there is this type of problem. Either the comment is outdated or miswritten, or there's a larger issue here.

Ultimately, teams want their robot, and even its low-level communication routines, to be reliable. Reliability is the result of good error detection, good error recovery, and good error reporting. This file already uses timeouts properly, meaning that errors are able to be detected well. However, I would like to point out the lack of both good error recovery and error reporting routines in this file, and still wonder why arbitrarily-chosen wait periods were thought to be a valid mechanism to help the system recover from ESD.

Even with the rest of the robot suffering 20kV, I would still be willing to bet that the Android device is able to execute code. Any opportunity to execute code on a system during a time in which a problem is occurring is an opportunity to enact both error recovery and error reporting routines. Thread.sleep() does neither.

In conclusion, the time spent in Thread.sleep() can be better spent, and currently, points 2 and 3 above are not entirely fulfilled by ftc_app. The presence of comments that say "we made this up" in a critical low-level communication routine that has the potential to stop the entire robot during a match worries me.

Going forward, I'm wondering if there is any documentation on the topic of using delays (not timeouts) available from MR, or authors of this code who might be able to explain the reasoning behind these arbitrarily-chosen delays and why they were implemented. Specifically, I'm interested in the handling of ModernRoboticsResponse's flag accessors (notably isFailure()) by the readOnce() and writeOnce() methods of ModernRoboticsReaderWriter. What kind of testing was used to determine that wait periods were needed? Additionally, can anybody explain why a device might accidentally "think" that it is in firmware update mode?

Again, there could be no problem with any of this code; I'm speculating, and just reading the code and forming hypotheses. However, more information about ftc_app's low-level communication with hardware could help teams everywhere better understand what exactly is happening when their robot stops working.

rgatkinson commented 7 years ago

That's a lot of words.

The timeouts are hit, but only in error situations (like ESD events), not in the course of normal processing.

When an ESD hits, our main task is to survive at all and recover back to an operating state, hopefully within a handful of deciseconds. The robot will necessarily not be at its best while that happens, that's true, but hopefully only for a short interval. The use of delays is at times a useful part of implementing such recovery, as it allows bounded-time events to complete before one takes certain corrective actions. If you want to get further into the details, I suggest diving into the (unfortunately patchy) available information on the operation of FTDI USB transceivers and the operation of the USB layer in Android itself. Some of this is linked to in comments in the SDK source.

"we made this up" comments might better be written as "constant has been empirically / experimentally determined"; those values are most certainly not random, just not part of an externally documented system. It's also worth noting that the threads on which interactions with MR devices takes place are worker threads, not threads on which team's code executes: thus, user code is not held up here, though responsiveness to the one MR device is (ie: the robot is not at its best during this recovery).

Documentation on the MR protocols is something you'd have to look to MR to provide.

aedancullen commented 7 years ago

An ESD event is part of normal operation and normal processing in FTC. It should not be considered an exceptional event at which point extreme measures need to be taken to recover. Thus, it's not really reasonable to call this an error situation, since the case that communication is interrupted is perfectly likely during a match.

I think that the FTC SDK should treat ESD events as normal operation, and keep attempting to reasonably communicate with robot hardware rather than 'give up' and sleep the thread. I'm sure many teams would agree. The problem of ESD-induced robot failures might become less of a problem with good software.

Waiting 100ms or 40ms is an extreme measure, and as far as I can tell this is done for no particular reason. Digital communication is not something that should require trial-and-error. We should be able to enact more intelligent error handling than just suspending the thread for an amount of time. (I have worked with FTDI devices before, and am familiar with their operation).

What particular bounded-time events need to be waited for? The FTDI chips and microcontrollers within the MR devices are capable of operating many orders of magnitude faster than the millisecond level; there's nothing to wait for there. I'm sure we can all agree that 20kV is not present anywhere for more than that long. Even 40ms is an eternity for a microcontroller, and an eternity for USB too. The same goes for a handful of deciseconds. That's a really long time from the point of view of the MR hardware. By the time you've finished waiting that long, by all means, buffers may have overflowed, and the MR firmware may be doing who-knows-what. USB transmission latency is <1ms. I cannot see how these delays would be necessary to wait for anything. What are they waiting for? Does MR say that the Android device must wait? And, more importantly, what corrective actions are taken? The only corrective action that I can see is a simple retry of the communication (up to 5 times in 3.0).

In my opinion, any kind of sleep/wait/delay time, unless part of an exponential backoff routine, is bad practice when included in communication software. You won't find such experimentally-determined constants in any communication system. (Take TCP as an example. It uses exponential backoff). TCP would not work very well if it waited for a predetermined, constant period of time before attempting to communicate again.

rgatkinson commented 7 years ago

I have to disagree, on several fronts (the SDK does not 'give up'; intrinsic antagonism against waits is an incorrect, misdirected concern; ESD events are rare compared to normal communication (100's+ of Hz), are chaotic and disruptive, and necessitate significant, necessarily heuristic measures to put humpty dumpty back together again), but I'm not certain how best to proceed further, here, in explaining that, short of walking line-by-line through the code explaining the history and background of each.

Ultimately, the results are what counts, and the measures implemented recently and now deployed in the 3.0 Alpha have, as concretely, experimentally demonstrated, significantly improved the robustness to ESD events.

aedancullen commented 7 years ago

Thank you for the help @rgatkinson; this has been an interesting discussion :)

I have a fork of this repository at https://github.com/aedancullen/ftc_app, and will do some experimentation on the 'reliability' branch there when I can find time. Until then, though, I suppose we can close this issue since this isn't really an easily-fixable or obvious problem. Thanks again to everybody for the input.

qwertychouskie commented 7 years ago

The DigitalChannel latency still has not been solved, so this should probably stay open.

aedancullen commented 7 years ago

That's a good point @qwertychouskie. However, we should probably direct discussion of reliability/resiliency to a new thread if anyone is able to gather new insight about these odd delays.

cmacfarl commented 7 years ago

@aedancullen I think we have to be careful to not lose sight of the forest for all the trees.

The original discussion on latency is relevant. The FTC Technology Team is aware of what one might call non-optimal latency in certain applications and is actively looking to improve the experience for all teams. To that end, if you have a little extra time, would you mind gathering some profiling data over your digital i/o opmode? See: https://developer.android.com/studio/profile/traceview.html If you can attach profiling data to this thread that is viewable by traceview, that would be useful.

While it is indeed the case that bulk data transfers over USB are quite fast, transaction setup times are quite slow. The recently pushed alpha has some changes that may improve this, but work is ongoing in the performance profiling sphere.

I'd encourage teams to give the alpha branch some soak time. And while I don't want to entirely shut down discussions of resiliency I think it's appropriate to allow enough time to get some hard data on team experiences before picking apart recovery timeouts.

aedancullen commented 7 years ago

@cmacfarl Ok. (Note though that it's the delays that are unusual, not the timeouts, and that these are two extremely different things).

ftctechnh / ftc_app

DigitalChannel performance is disappointing and points to possible low-level communication oddities #341