ftctechnh / ftc_app

FTC Android Studio project to create FTC Robot Controller app.
758 stars 3.17k forks source link

Stuck USB write / Problem communicating with REV Robotics USB Expansion Hub #464

Open aedancullen opened 6 years ago

aedancullen commented 6 years ago

Hi, everyone. We're having another USB connectivity conundrum, this time with the REV Robotics hub.

(By the way, if any of the Qualcomm devs are reading this, we're really happy to see the ABNORMAL_ATTEMPT_REOPEN ShutdownReason and all of the "device health" metrics being added to the codebase after the Modern Robotics USB problems of the past few years. Thank you for all of your hard work, and sorry that we're bringing up another connectivity problem...)

Anyway, we've got a single Expansion Hub connected to four AndyMark DC motors. And version 3.5 of the app.

We can consistently reproduce, after 15-20 minutes of driving our wheelbase on the field, the "Problem communicating" message. It does not say "detached"; it says "problem communicating".

And so we've found this in the logs. Here is the relevant section of the log. We have reproduced this problem over 5 times, and every time the log shows exactly the same messages. To summarize, everything is running normally until MonitoredUsbDeviceConnection: watchdog: stuck USB write(14 bytes) threadId=337 TID=4312:: serial=DQ15JTWH closing device It is always 14 bytes, during all of our tests. That's interesting, but since the protocol isn't documented it's hard to dive deeper into what's really failing without reverse engineering effort... Then, the "problem communicating" message is produced. You can see this in the attached log. Shortly afterward, RobotCore: event loop: device has shutdown abnormally: ABNORMAL and EventLoopManager: event loop: detaching device DQ15JTWH

I obviously haven't read the entire Qualcomm source, but the origin of the message (for Lynx/REV Expansion devices; Modern Robotics isn't relevant here) is line 755 of LynxUsbDeviceImpl.java (in com.qualcomm.hardware). And, of course, "exception thrown in LynxUsbDevice.transmit" indeed appears in the log.

So, there is a transmission problem, and, to quote LynxUsbDeviceImpl.java:

// For now, at least, we're brutal: we don't quarter ANY usb transmission errors before giving up and shutting things down. In the wake of future experience, it might later be reasonable to reconsider this decision.

So this is a transmission error, and of course when it occurs the software instantly gives up. We can't really investigate any further as to why there's a transmission error and what the Lynx (REV Expansion) module is actually doing, because its firmware is closed-source and it obviously doesn't have a log.

(I know FIRST is committed to this control system, but from reading the code I think it's pretty clear that using a high-level programming language on a non-realtime operating system and throwing data back and forth over a relatively slow and unreliable bus (UART is really slow, and prone to errors!) might not have been the best architecture choice for a robotics control system....)

Apart from that, though, there is certainly something causing the communication to be interrupted on the hardware level. Whether that is ESD, or whatever, I don't know, and nobody ever will. (We have a bunch of 12V power lines close to the REV module -- possibly creating a magnetic field and inducing current in a PCB trace? We have yet to test it with the wires moved.) But what it comes down to is that ftc_app doesn't even try to recover in the case that the disturbance caused a transmission error.

Has anyone else had "problem communicating" with the Lynx (REV Robotics Expansion) module? Developers, do you think it might be good to reconsider whether or not we should try to recover from USB transmission problems?

magneticflux- commented 6 years ago

@aedancullen If you want to test implementing error recovery yourself, you can fork the Modular FTC project's sample and use Gradle's dependency substitution to replace all hardware dependencies with a modified version. You'd just need to extract the sources of the hardware jar into a new Gradle project (like the other modularlized dependencies which you could use as a template) and include it as a Composite Build.

Disclosure: I am affiliated with the Modular FTC project.

aedancullen commented 6 years ago

Thanks @magneticflux-, I'll take a look.

aedancullen commented 6 years ago

By the way, this Expansion Hub is running the 1.6 firmware. Soon we will try it with the new 1.7. Maybe this issue is related to the I2C lockup bug (we are using the internal IMU...)

slylockfox commented 6 years ago

@aedancullen I worked with a team last weekend that experienced a disconnect from the REV hub after about 2 seconds into every match, and this "14 bytes" message appears in their logs. Any further idea what causes this?

aedancullen commented 6 years ago

We didn't ever find any connection between the REV firmware version and this problem, or any other type of robot hardware configuration. (Which you wouldn't expect anyway, because it's caused by ftc_app decidedly not attempting to recover from communication problems.) The comments in the source really summarize the issue well - the root cause is USB weirdness (whether that's due to ESD, mechanical instability of the connector / lack of strain relief, etc.) and there isn't recovery code in some places where there could be recovery code...

I assume that the 14 bytes might be a sort of header for messages sent on the UART serial protocol that runs over the USB, and thus the 14-byte error has ended up being an indicator of a general communication fault (a message of any type was not communicated successfully.) I may be completely wrong about that though - I don't have any protocol documentation and have not reverse-engineered the protocol or even read all of the ftc_app source.

So for now, we will have to just keep our firmware and ftc_app up-to-date, try our best with USB strain relief and whatnot, and hope that sometime in the future a version of ftc_app will be released that has recovery routines in every possible place that they can be implemented... And thank you to the developers again for the many updates and improvements to ftc_app already in the past years.

I mean, it occurs to me that maybe the developers should really just have a handler on the mainloop of the app that, if any error that currently causes an emergency-stop (complete robot shutdown) arises, simply ignores it, possibly refreshes USB-layer stuff, and continues trying to run the opmode, hoping that the comm layer rediscovers the module eventually? This should be simple.... as I've said before, under no circumstances should ftc_app be shutting down teams' robots and giving up when there's a communication problem, because these oddities are always temporary...

aedancullen commented 6 years ago

We did mount the REV expansion hub with the USB connector facing straight up on our robot in order to try to avoid stress and connector loosening (whether this actually helps or not we don't know for sure, but the robot has been pretty reliable so far).

gearsincorg commented 6 years ago

Question:

Is there any possibility that you have one of your expansion hubs set with address = 1? I looked at the log, but there wasn't enough in there for me to tell....

aedancullen commented 6 years ago

We do, yes. The one connected to the phone is 1, and the other over RS485 is 2.

gearsincorg commented 6 years ago

We are starting to see a bit of a pattern developing on systems that are reporting this "stuck USB write(14 bytes)" issue. Most systems never see this problem, but a few see it very often. So we're looking for the trigger.

One thing that we are noticing is that systems that exhibit the problem seem to have to have a first Hub Address of 1. This might seem perfectly normal, except that the current firmware may have different behavior with address 1, since it was originally allocate for use with the new "Control Hub" version of the expansion hubs.

I don't know the specifics, except that address 1 hubs are treated differently. (good to know.. right?) This is why hubs usually ship with address = 2. Not sure why there isn't a big "Do not set to address 1 warning...." but maybe it shouldn't matter.

Anyway, Is it possible for you to change the address 1 hub to address 3 and "fix" the robot configuration to match? We'd be eager to see if this makes a difference.

Phil.

aedancullen commented 6 years ago

Oh... very interesting! Wow; I will try that.

Do you know if the internal Control Hub microcontroller uses USB to connect to the Snapdragon SoC, or does it use some other bus? Maybe address 1 was only designed for something else. Was it designed for native UART use, maybe? (the internal "hub" microcontroller talks directly to the Snapdragon over a hardware UART rather than over USB?) It would make sense that the designers of the Control Hub might consider that more reliable. Somewhere in the firmware there could be something obscure like "assume problem X does not require recovery if we're on address 1 because there is no USB/FTDI layer"

gearsincorg commented 6 years ago

I don't even know enough to speculate, so let's not put the cart before the horse.

First order of business is to see if it makes a difference. One would expect the problem to go away if it's a real trigger condition. To be conclusive I'd want to be able to induce and/or inhibit the bug based on the use of address 1.

cmcknight commented 6 years ago

Just a thought, but would it be worthwhile to ask Rev Robotics? I asked them whether they were going to make the source to the firmware available a couple of months ago and they said “not at this time,” but they should be able to answer questions about how it was implemented in this case. Also, I saw there was a new firmware update a few weeks ago so that may play into things as well.

On Jan 17, 2018, at 6:00 PM, Phil Malone notifications@github.com wrote:

I don't even know enough to speculate, so let's not put the cart before the horse.

First order of business is to see if it makes a difference. One would expect the problem to go away if it's a real trigger condition. To be conclusive I'd want to be able to induce and/or replicate the bug based on the use of address 1.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ftctechnh/ftc_app/issues/464#issuecomment-358512902, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHtYQLQJGbAnWz9B9Jylw9IFPTQkJVOks5tLqW5gaJpZM4QqlUg.

NoahAndrews commented 6 years ago

The team I mentor that has experienced this issue a few times has their only REV hub on the robot set to address 2 (according to the config file on their GitHub anyway). I'll verify that it's still address 2 on Friday.

It is true that the Control Hub does not connect to its internal Expansion Hub board via USB. In software, the connection is accessed like an old serial port would be (/dev/ttyHS4).

gearsincorg commented 6 years ago

Noah, Make sure we are talking about the same symptoms... (stuck 14 byte write) A log file would be useful.

NoahAndrews commented 6 years ago

Yep, we are. I was also able to verify that the hub is still on port 2 from the logs.

These logs can also be found on the FTA forum. RKRUnknownUSBDisconnect1.txt RKRUnknownUSBDisconnect2.txt

aedancullen commented 6 years ago

In @NoahAndrews logs, there are stuck writes other than 14 bytes. We have not been able to reproduce any other types of stuck writes with the hub on address 1. I don't know if this has any significance. It does appear from those "RKR..." logs though that the stuck write problem is not restricted to address 1, and it probably can occur anywhere (even if there's only one REV hub attached...)

It would be interesting to be able to ask the REV firmware developers and the ftc_app low-level developers (of com.qualcomm.robotcore) questions about things like this. I'm interested in why they have chosen to implement separate commands for every individual action ("LynxSetServoPulseWidthCommand", "LynxSetServoEnableCommand", etc.) I'd think it might be more reliable to use a single "frame" which is sent at regular intervals and bundles all control data into one package rather than individual commands and responses.... and if there's a protocol error, you just drop a "frame" and carry on. (And, that's how other low-latency, high-bandwidth applications like video streaming are implemented, of course.) Then there's no waiting for anything on the host side, because "upstream" frames can just come at regular intervals as well. With a protocol designed with all of these individual commands and responses, the bus is unnecessarily taxed by carrying individual responses for every single thing your robot does, and if one command/response fails, like we see here, the whole system can fall apart. But I'm sure there's a reason for why it was done this way...

When I have a chance I'll experiment with addresses other than 1 and see if they produce stuck writes other than 14 bytes, like in NoahAndrews' logs.

slylockfox commented 6 years ago

@aedancullen any conclusion on avoiding address 1? I remember hearing a theory that, because the FIRST Global REV "control hub" had default address 1, there might be unexpected behavior using 1 for an expansion hub. A believable theory.

aedancullen commented 6 years ago

We competed all day yesterday in a qualifier with no connection issues at all using addresses 2 and 3. Last time we participated in a tournament, with an address set to 1, there were a few communication problems (stuck write of 14 bytes). We didn't make any significant changes to the electronics setup since then (we had the same positioning of the Expansion Hubs on the robot, same motors and sensors for the most part, same USB cable, etc...) I don't know of any better procedures for experimentally testing the reliability other than just running the robot for long periods of time.

It's probably a good idea to avoid address 1 for now until the REV firmware developers can say for sure whether or not there's different behavior when hardware UART is used (as in the Control Hub). That should be as simple as searching the firmware codebase for any conditionals that compare the address setting to 1. We don't need to know what behavior exactly is different, just whether or not there is different behavior.

I don't know who to contact to find out for sure... If this really is the case then the option for address 1 should be removed from ftc_app's address setting menu.

cmcknight commented 6 years ago

I just dropped a note to Rev Robotics technical support (support@revrobotics.com) asking about the UART Address 1 issue. I’d invite others to contact them as well because a large number of requests about the same issue might get it resolved more quickly.

Sent from my iPad

On Jan 28, 2018, at 9:34 AM, Aedan Cullen notifications@github.com wrote:

We competed all day yesterday in a qualifier with no connection issues at all using addresses 2 and 3.

It's probably a good idea to avoid address 1 for now until the REV firmware developers can say for sure whether or not there's different behavior when hardware UART is used (as in the Control Hub). That should be as simple as searching the firmware codebase for any conditionals that compare the address setting to 1. We don't need to know what behavior exactly is different, just whether or not there is different behavior.

I don't know who to contact to find out for sure... If this really is the case then the option for address 1 should be removed from ftc_app's address setting menu.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

robogreg commented 6 years ago

We have seen this thread and are looking into it. There is some back end work that needs to be done to determine if this is actually a self contained bug or something that is actually a symptom of another issue or circumstance. If it is an issue we need to look at if this is something in our firmware or something part of the USB stack in the SDK.

We will keep you posted, but for now if you want to avoid address 1 there is no harm in doing so or recommending others to do so, we just want to be careful with creating tribal knowledge without a firm justification for doing so.

*PS, we see every support email, read the github comments, read all of r/FTC, the FTCforums, discord, etc. Trust us, if there is an issue people are talking about we know about it, but we prefer to take our time and prefer accuracy of response/action over instant response. There isn't a need to SPAM our support email lines, as it won't change our methods or time for determining if issues exist.

aedancullen commented 6 years ago

Maybe we can back up a little bit. Is address 1 actually intended to be reserved? We should just answer that question for sure first, because doing so doesn't require any debugging work.

It might be helpful to know whether or not it was/is the design intent to reserve address 1. Because if it was, and Address 1 is intended to have a different behavior, then there's probably nothing to debug, the system is (probably) working as designed, and the right solution is to simply have people not misuse Address 1 for a configuration it wasn't designed for. (Of course, reliability issues not related to address usage are a different thing altogether). If it wasn't the design intent to reserve address 1, then there might be a legitimate issue.

So, before we consider Address 1 a legitimate issue, @robogreg, can you tell us whether or not it is actually supposed to be used only as the internal Lynx module address on Control Hubs? It would just be interesting to know for sure.

roybto commented 6 years ago

We had the "stuck USB write" shutdown twice at the Washington State Championship, although once with 14 bytes and once with 18 bytes. We have our phone connected to hub 2, while the "slave" hub has address 1. We have gone all-out with strain relief, and in fact one of the "disconnects" happened when the robot was slowly driving across a clear mat, with nothing around it. By the way, our robot seems to be better behaved when we're back at our practice facility--we can bounce it around really hard there without any disconnects, but at two tournaments now we've seen disconnects even without collisions. Come to think of it, the problem only started occurring after we added a second hub, which has address 1. Prior to that, no disconnects at all, and that prior experience was with similar or lower-quality strain relief. So we're seeing a problem that appeared only when a hub with address 1 was added (though in the "slave" position).

slylockfox commented 6 years ago

It was confirmed on the FTA call this week, that REV recommends not using address 1 for an expansion hub. Address 1 is effectively reserved for the control hub, if and when it becomes legal for FTC.

aedancullen commented 6 years ago

Maybe REV could decide to use a different firmware binary for Control Hubs, and alter the Expansion Hub firmware so that it'll refuse to use Address 1, and make the Control Hub firmware always use address 1. That'd be a low-level, robust fix. They are different products, and so each could have a unique firmware which ensures that address configurations are suitable for the hardware (even if the bulk of the code ends up being the same). Or, of course, if REV can distinguish between the hardware platform dynamically, within firmware (reading factory-programmed flash, EEPROM, etc), then that'd be another approach if they want to keep a unified firmware binary.

A short-term fix would be to remove the address 1 option from ftc_app as soon as possible. But that is not ideal, because it would still be theoretically possible for someone's hub to be using address 1.

Obviously, communication problems unrelated to this address 1 thing might still be another issue.

So it seemed to me that simple things like address setting shouldn't be that hard to address in a firmware update... The Expansion Hub firmware probably just needs a condition added that checks for the address being 1, and then just sets the address to 2 if that's the case. (And then, of course whatever code used to be run when the address was 1 can just be removed.) I can't imagine that it is much more complicated than that, but if it is a more complicated thing please correct me.

roybto commented 6 years ago

It seems like there should be a formal recommendation on this published to the FTC Forum. Phil (assuming you are still following this thread) it seems like this should be done "officially" so it's more noticeable.

gearsincorg commented 6 years ago

Agreed. Requested.

@tomeng70 ?

roybto commented 6 years ago

There is one more observation we've had is that there wasn't much of a problem with this at our practice facility, which is in a private home in a residential neighborhood. But at two tournaments with about 35 teams competing, we had a lot of trouble. While this is apparently a phone-expansion hub problem, is it possible that when there is a lot of wifi direct interference the problem is more likely?

gearsincorg commented 6 years ago

When I see this happen, it's typically tied to a specific robot. This points more to robot specific hardware/wiring/sensors/configuration/ESD issue. (Competition arena's offer very different ESD characteristics)

I don't see any cause to go down "network traffic" road. ​ Phil.

aedancullen commented 6 years ago

@roybto I agree with @gearsincorg; that'd be a separate issue since the address-1 thing isn't related to Wi-Fi at all. Disconnections at the Wi-Fi layer (or high latencies) are also an entirely different thing than disconnections / problems communicating to the REV hub over USB/UART and the two shouldn't be related.

Take a look at your logs though - if you see "stuck USB write" even when you aren't using address 1, then that would be a valuable finding.

roybto commented 6 years ago

Our team (FTC 9915, now Washington State Winning Alliance Captain) has changed to addresses 2,3 instead of 1,2. We are really hoping this will solve our disconnection problems. Luckily we survived our State tournament with the help of some extremely competitive alliance partners. The problem is, we haven't seen this issue much in our own facility, but had it frequently at tournaments, which is why we've suggested that there may be a confounding factor related to having many teams participating simultaneously. We will be looking for this issue carefully as we prepare for the West Super Regional. There's still no official announcement on the FTC Forum yet, but there's only one day before this weekend's qualifying tournaments. Can we get an official announcement on the issue, and also have the inspectors at the weekend's qualifying tournaments flag the issue? It only takes a few minutes to change a hub address, and yet it might save some otherwise very competent teams losing the opportunity to advance.

aedancullen commented 6 years ago

@roybto can you post logs where the disconnections happen? Are they still happening on addresses 2 and 3? If so, we'd be really interested in seeing the logs, because it could shed more light on that "stuck USB write" message and whether its occurrence is actually limited to the use of address 1 or not.

roybto commented 6 years ago

I have uploaded a logcat excerpt with two stuck USB write errors in it. One is at time 15:37:09.004, another was at 17:16:05.275. These were observed with the master hub at address 2, and the slave at address 1. We are using two MotoG4 phones, SDK 3.5, and both hubs updated to firmware 1.7.2. The first was our fifth and last qualifying match at the Washington State Championship, while the second was the first Elimination match we ran. In the first match we died in teleop driving across the mat without any contact. In the second we died as we were mounting the balancing stone at the end of the match, and ended up rolling off the opposite side of the stone due to the disconnect.

(https://github.com/ftctechnh/ftc_app/files/1731808/logcatStuckUSBWrite.txt)

ftctechnh commented 6 years ago

@robyto roybto - I am sorry that your team is having issues with your control system.

You mentioned that the issue started happening when you added a second Expansion Hub to your robot. I have personally seen two instances here in NH where the Robot Controller had issues communicating with a downstream/daisy-chained device.

After careful testing, in both cases it appeared that the problem was caused by a faulty XT 30 power cable connection that was used to power the second Expansion Hub. In both instances (with two different robots), if we pushed slightly down on the XT 30 cable that was feeding into the second Expansion Hub so that the red and black wires that fed into the XT30 connector bowed slightly (i.e., pushed the wires slightly down and away from each other with our finger) the second Expansion Hub would lose power and cause comm errors with the Robot Controller.

The problem was difficult to isolate, because it was intermittent / difficult to replicate. Note that after we demonstrated the problem several times in a row, we disconnected the power cable, then reconnected it, and weren't able to reproduce the problem. I don't know if the issue was with the cable and/or the connectors.

For your robot, have you checked the XT30 and RS485 cables that are used to daisy-chain the second Hub to the FIRST and made sure that your not potentially experiencing intermittent disconnects? Replacing the suspect cable, and securing and strain relieving the new XT30 extension cable seemed to help.

Note that I do not know if this is the same issue that your team is experiencing, but it might be worthwhile to inspect your connections and see if they appear to be loose (can you get the power to cut by pressing or jiggling the wires or connectors?).

robogreg commented 6 years ago

Everyone, I also wanted to let you know that we have been working with the FTC dev team looking at this. There is nothing in the Expansion Hub Firmware that requires address 1 to not be used. There are some provisions in the FTC SDK for the control hub that make the assumption that all control hubs are on address one, as they don't communicate via usb. There has been testing done, and we have been unable to replicate the issue of an expansion hub having issues being #1 with phones connected via USB.

We generally feel that there is likely another cause of the issues (like Tom mentioned above about wires & connectors) that are causing your issues and it is likely a coincidence that changing addresses (which requires you to connect individually to each hub there by adjusting your physical connections to the hubs) saw relief from some of your issues.

While we don't see a reason that address 1 should cause any issues whatsoever, there are also no issues with your hubs being any other pair of addresses either you are welcome to change addresses if it makes your team more comfortable you are welcome to do so.

aedancullen commented 6 years ago

Ok, that's really good to know. So there is no "Address 1 Bug'.

But @ftctechnh and @robogreg, what do you think about stuck USB writes? Power issues would indeed cause disconnections, but when we opened this issue we were focusing on the "stuck USB write" message, and we still haven't figured out its cause. (It's the source of @roybto's issues, I've been able to reproduce it, and @NoahAndrews has seen it with hubs that aren't set to address 1). Are "stuck USB write" messages really produced because of an RS485 comm problem? The message specifically says "USB".

If "stuck USB write" cannot be caused by RS485 communication problems, then issues caused by stuck writes can't possibly have anything to do with RS485 or the second hub losing power. I could be wrong, but it didn't seem likely that the error message would be totally misnamed...

In the log that roybto shared, all of the "stuck USB write" problems result in the disconnection of hub DQ15TECQ. roybto, which hub is that? Is it the one directly over USB, or the daisy-chained one?

This stuff is really interesting, and there are still a lot of things to figure out. Thanks robogreg for clearing up the Address 1 thing :)

roybto commented 6 years ago

We are aware that the xt30 connector wires supplied by REV are inadequately strain relieved, but that doesn't appear to be our problem. The 485 cable looks ok, but we will change it. We've tried to reproduce the problem by doing very aggressive driving, bumping, and collisions, but haven't seen that cause the issue. The first disconnect in our posted log happened driving across the mat without even touching a glyph. Because this issue has been with us for a couple of months, we've gone all out to strain relieve our USB connections. We are considering replacing our phone and REV hub just in case there's a weakness in their USB connectors. Thank you all for your help with this issue. We've had the bad luck of suffering this problem, but had some good luck and great alliance partners. So we are advancing to Supers nevertheless. But we REALLY want to resolve the issue before we get there! (We are team 9915).

On Feb 16, 2018 8:05 AM, "FTC Engineering" notifications@github.com wrote:

@RobyTo https://github.com/robyto roybto - I am sorry that your team is having issues with your control system.

You mentioned that the issue started happening when you added a second Expansion Hub to your robot. I have personally seen two instances here in NH where the Robot Controller had issues communicating with a downstream/daisy-chained device.

After careful testing, in both cases it appeared that the problem was caused by a faulty XT 30 power cable connection that was used to power the second Expansion Hub. In both instances (with two different robots), if we pushed slightly down on the XT 30 cable that was feeding into the second Expansion Hub so that the red and black wires that fed into the XT30 connector bowed slightly (i.e., pushed the wires slightly down and away from each other with our finger) the second Expansion Hub would lose power and cause comm errors with the Robot Controller.

The problem was difficult to isolate, because it was intermittent / difficult to replicate. Note that after we demonstrated the problem several times in a row, we disconnected the power cable, then reconnected it, and weren't able to reproduce the problem. I don't know if the issue was with the cable and/or the connectors.

For your robot, have you checked the XT30 and RS485 cables that are used to daisy-chain the second Hub to the FIRST and made sure that your not potentially experiencing intermittent disconnects? Replacing the suspect cable, and securing and strain relieving the new XT30 extension cable seemed to help.

Note that I do not know if this is the same issue that your team is experiencing, but it might be worthwhile to inspect your connections and see if they appear to be loose (can you get the power to cut by pressing or jiggling the wires or connectors?).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ftctechnh/ftc_app/issues/464#issuecomment-366278158, or mute the thread https://github.com/notifications/unsubscribe-auth/AhDeIlhrs8wj0yzx7irhJesJoMofq1-1ks5tVadggaJpZM4QqlUg .

ftctechnh commented 6 years ago

roybto - please keep us in the loop on how things go, and if the problem resurfaces, please grab the log files of the Robot Controller (and the Driver Station if it's not a lot of extra work) and note the date and approximate time of the incident.

Unfortunately, this issue has been difficult for us to reproduce so it's hard to definitely determine the cause of the USB write failures.

Congrats on advancing! Best of luck for a FUN season!

robogreg commented 6 years ago

@roybto If you want to swap hubs if you think your USB is loose, please send an email to support@revrobotics.com and reference this post and we will work with you to get yours back and send you replacements. This will also give us the opportunity to open yours and see if anything else is going on that might cause it.

Just for curiosity sake, have you tried swapping which hub the phone is connected to? If so did you see the error persist across this change?

roybto commented 6 years ago

Greg, The most frustrating thing about these failures is they have tended to happen when we are at a tournament (first at Interleague, then at State). Assuming they were related to mechanically-induced disconnects between the phone and the hub, we've done a lot of connector wiggling and jiggling, and also driving our robot very roughly, but not been able to induce the disconnects. But then driving across the mat without contacting another robot or field elements, we see disconnects (but only when it's a tournament, not in practice on our own field). But the equipment we're using is identical, same phones, batteries, cables, and so on. That's why I was conjecturing that the problem might have something to do with WiFi congestion, because that's one characteristic that's different at tournaments than on our practice field. We've even built a robot with Andymark Stealth Wheels, because we know they tend to promote electrostatic charge due to the materials being triboelectrically different than the FTC mats. Running that "charged" robot into our competition robot doesn't induce these failures, either.

We have four of your hubs so we would just swap in one or two new ones and use the "suspect" ones for development and test bots. We've held off buying more, figuring we will be buying four control hubs as soon as they are FTC legal. On the other hand if you think it would help to get them back to REV for test, we could do that. The mini-USB on our hub doesn't feel particularly loose, and we've gone through a few different cables to the phone and seen the problem persist. I know the control hub is a big part of the solution to phone disconnect problems. (I wish the connection to the hub was an ordinary USB, they seem to be much more reliable than the mini-USB connectors.)

If we swap our hub for another one we have and the problem goes away, then I expect we will ask to return the "suspect" one to you. We are going to have a week hiatus and then continue with daily meetings up to the West SuperRegional.

Thanks so much for working hard on these problems. I know how frustrating it can be to try to solve problems that may not even be real. Yours truly, Roy Mead FTC 9915 Coach

On Feb 16, 2018 1:06 PM, "Greg Needel" notifications@github.com wrote:

@roybto https://github.com/roybto If you want to swap hubs if you think your USB is loose, please send an email to support@revrobotics.com and reference this post and we will work with you to get yours back and send you replacements. This will also give us the opportunity to open yours and see if anything else is going on that might cause it.

Just for curiosity sake, have you tried swapping which hub the phone is connected to? If so did you see the error persist across this change?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ftctechnh/ftc_app/issues/464#issuecomment-366358291, or mute the thread https://github.com/notifications/unsubscribe-auth/AhDeIhUQ_R7H-N3IMoBZkj6PMFdzbTFgks5tVe3pgaJpZM4QqlUg .

aedancullen commented 6 years ago

I wonder if the Wi-Fi Direct "robocol" code is in the same mainloop as the USB stuff within ftc_app? If ftc_app isn't using separate threads, then it might be possible (if a bit unusual) that the time spent on one communication could affect the reliability of the other. That's the only thing I can think of that might connect high ping times (which you'd see in congested Wi-Fi areas, where there are more retransmitted packets that take longer to arrive) to USB problems.

I can try to look farther into what causes the "stuck USB write" in the ftc_app code... if it does end up being related to Wi-Fi, then that'd easily explain its inconsistency (for some environments and robots it occurs often, for others it doesn't).

roybto commented 6 years ago

If anyone can recommend additional logging that we can do to help isolate the problem, please tell us and we will add it to our code. For example, is there a way to log the 12v supply voltage for each hub? And is there a set of commands that would particularly do so when an error condition is encountered (like stuck USB write)? Is there a way to write into the log the ping time (latency) to the Driver Station phone, in case that does have something to do with the issue?

ftctechnh commented 6 years ago

Hi Folks,

FIRST and the FTC Technology Team have been testing this issue.

The information that I am about to provide is preliminary, and I would be careful to jump to any immediate conclusions. However, in our testing, we have been able to reproduce this error by electrostatically shocking the robot/Expansion Hub. This is one way, that we have been able to reproduce the error, but we don't know if it is the only way.

I would like to also add that in my own personal testing of the Expansion Hub, it has been difficult to reproduce this issue, even if I very aggressively shocked the Expansion Hub in a manner that would not commonly occur in a typical match. In my own personal tests of the Expansion Hub (long term and recent short term tests) although it is vulnerable to some ESD-related disruption (as all electronics are) the overall ESD-resiliency of the Expansion Hub is very good.

In my testing, I charged the frame of the robot to a voltage > 25kV and then discharged the excess charge through very sensitive areas of the Expansion Hub (not through the frame or motors/servos/sensors, but actually on the exposed pins/connectors of the Expansion Hub and USB cables). This is a fairly aggressive way to shock the Expansion Hub and it did very well. I had difficulty even forcing the error a few times in this manner.

If your team is encountering these "Stuck USB write" issues and you've been careful to verify that the wiring is secure (and not shaking or being jolted loose) , it might be beneficial to take some steps to try and mitigate the effects of an ESD event on your robot.

There are some simple steps that teams can implement that help mitigate the effects of an ESD event:

  1. For example, mounting your control system electronics on a piece of dry plywood (or some other material that has a high dielectric coefficient like PVC type A) does appear to help.
    • FIRST did a series of tests in the past that compared the effects of ESD events for a system mounted on a thin (1/8") piece of plywood vs the same system mounted on thin polycarbonate or directly to the frame of the robot.
    • We used a van de graaf generator to charge the frames of the robot to a voltage > 25kV (which was the limit of our electrostatic voltmeter).
    • Using the plywood dramatically reduced the frequency and severity of the disruptions.
  2. Similarly, if possible, keeping your wires/cables off of the frame of the robot seems to help as well. Rather than secure the wires directly to the frame of the robot, if you can use some non-conductive material with a high dielectric coefficient (such as dry wood or PVC type A) to keep the wires off of the surface of the robot frame, it seems to help reduce the frequency and severity of the ESD related events
  3. Also, it seems simple, but one strategy employed by some teams to mitigate ESD effects is to clad the parts of their robot that are prone to touching other metallic objects (like the field perimeter, or the edge of the balance stone) with an insulated material (like electrical tape). Electrically insulating these parts when possible also seems to noticeably reduce the frequency and severity of ESD related events.

We will continue to investigate the cause of this issue. However, if you have explored the other options (verifying the wiring is secure, making sure power or serial data are not being cut to the device) but are still having these problems then it might be worthwhile to see if taking steps to mitigate ESD effects helps.

Below are photos from a team robot that ran through this season's NH events (and scrimmages) without any evidence of ESD related issues. The electronics were simply mounted to a thin sheet of plywood. In spite of the dry interior climate here in NH, this bot did seem to experience any ESD-induced failures during their matches or in practice.

image1 image2

roybto commented 6 years ago

First, we really want to thank everyone who has been working this issue, as it's been very frustrating for our team and probably several others.

We have an observation that suggests the ESD explanation for the "stuck write" problem may explain our (Team 9915's) shutdowns. We've noticed that sometimes when we touch the claw we use to pick up relics, the servos driving the claws, and sometimes the other servos on the robot, will twitch. Our robot utilizes the Servocity/Actobotics "Cascading x-rail slide kit" to extend to grab the relic, and a coiled cord to connect to the servos. In examining the cascaded slide, we've realized that due to the construction of the slide, each of the rails is electrically isolated from one another, and, of course, the claw holding the servos is electrically isolated from the frame of the robot. (The interstage wheels and sliders are all made of insulating material.) So one now has a substantial chunk of metal disconnected from the frame of the robot, at the end of about 8 feet of 22 AWG servo wire. It seems like a bad design if you want to avoid sending ESD discharges into the electronic control system. We actually connect the relic claw servos, and most other servos, through a REV Servo Power Module.

On the other hand, the shutdowns we have seen have usually NOT been when we have been manipulating this x-rail relic arm system. But since it's obviously not a good design to have big chunks of electrically isolated metal on the robot, we will make some changes. We will be meeting again on Sunday March 25, and can test to see if electrically connecting the stages of the slide kit to one another and to the robot frame prevents the "twitching" we are seeing. We will post here as soon as we have any results. Since quite a few teams are using the Actobotics kit, there may be many teams who will want to modify their systems.

AlecHub commented 6 years ago

The Expansion Hub has been out of stock several times during the season. Thus, I presume there have been several production runs of the Expansion Hub over the course of the season.

The quality of the hubs (ESD-resiliency) may vary between productions runs. I suggest that if teams suspect that they have problematic hubs, FIRST or REV retrieve the problematic hubs for analysis.

How does the [Control Hub <-RS485-> Expansion Hub] configuration compare with the [Phone <-USB-> Expansion Hub] configuration under the same ESD stress tests?

aedancullen commented 6 years ago

If you go down to the internal hardware level and consider that a Control Hub is really just an Expansion Hub (Lynx Module) with a Snapdragon system-on-module included, then the Control Hub configuration is really this: [ftc_app <-UART-> Lynx Module 1 <-RS485-> Lynx Module 2]

while the equivalent Expansion Hub configuration is [ftc_app <-USB CDC/ACM-> FTDI Bridge <-UART-> Lynx Module 1 <-RS485-> Lynx Module 2]

So you'd expect that the thing that might make the Expansion Hub setup less reliable would be the FTDI and USB parts of the control stack (ESD interrupting the USB communication, or ESD affecting the FTDI chip). But if ESD is locking up the Lynx Module microcontrollers, then all of the communication layers don't even matter I suppose.

When we experienced the "stuck USB write" problem, we were never able to get ftc_app to successfully redetect the Expansion Hub without power-cycling the hub entirely. It almost seemed as if the Hub had entirely locked up, causing ftc_app to lose USB communication. This would be consistent with the ESD findings, if ESD can indeed lock up the microcontroller completely. For anyone else who has seen the "stuck USB write" problem, is it true that the Expansion Hubs will never recover until they're completely power-cycled? (A simple "restart robot" won't work, for example.)

robogreg commented 6 years ago

A note to everyone about ESD, there is no such thing as a completely ESD proof device. It just depends on the amount of charge and the location of the discharge. The hub is highly protected from ESD, but making a device that withstands discharges without damaging the device is much easier than making a device which operates un-disrupted through those events. There is on going work that will be done by all parties involved in the FTC technology stack to see what can be done to mitigate the impacts of an ESD event. Good practices in regards to wiring to midigate ESD events have always aligned with general good robot building practices, so we continue to encourage them and we will try and write a resource guide this summer for teams in this regards.

@AlecHub There have been numerous different production runs for Expansion hubs. Through all of the runs there have been no changes or revisions to either the PCB or the testing quality control process. While there is always the chance of Manufacturing defects, the impact of production batch on performance is negligible and can basically be ignored when evaluating hubs.

aedancullen commented 6 years ago

@robogreg does the microcontroller used in the Lynx Module have watchdog timer capabilities? Many are built-in nowadays. Those can sometimes be helpful in reliability-critical applications as a software failsafe in cases where the hardware protection mechanisms aren't perfect. (The ESD event occurs, the microcontroller software hangs/crashes/fails, and the watchdog resets the system.) That sort of quick recovery does require that the firmware and communication stack can quickly reinitialize on a firmware "cold start", but if that's taken care of then a watchdog might almost guarantee that the system never becomes entirely disabled due to ESD, because it would just restart itself.

robogreg commented 6 years ago

@aedancullen Yes it does.

ftctechnh commented 6 years ago

Folks - I was talking to an FTA today about the ESD issues that can occur and he reminded me about the use of ferrite chokes to help further reduce the risk of an ESD disruption. If it's possible, it's a good idea to use USB cables that have ferrite chokes installed on both ends of the cables. The chokes act like low pass filters and filter out high frequency noise that can be induced by an ESD event.

You can buy snap on chokes and install them on your cables (one near the port on the phone and another near the port on the Expansion Hub or Modern Robotics module).

You can also buy USB cables that have integral ferrite chokes (see attached image below).

20180222_151339

The chokes do help filter out some of the noise that might be present on your cables during an ESD event.

Tom

roybto commented 6 years ago

The way the ferrite chokes work is to block fast "common mode" signals on the cable running through the choke. The phone USB communication that has current flowing out along a signal line and returning via the "common" wire in the same cable sees no effect from the ferrite. But if there's suddenly a voltage difference between the hub and the phone, the ferrite will prevent a fast current pulse between them. The ferrite makes sure that any fast current pulses are differential (same current returning on the common wire as went out on the signal wire). So the ferrites don't slow down any fast USB communication pulses, but at least force any currents in the cable to be balanced.
We're planning on adding a ferrite to the servo cable running out to the end of our relic arm as well. That way discharges at the end of the arm are less likely to propagate into the control system.