Closed digistump closed 8 years ago
Relevant files as mentioned above:
Generic Node File Server with HTTPS: server.zip
Attempt at releasing chunks of file more slowly: server_with_delay.zip
Firmwares to load after setting IP/Domain and Fingerprint, and source code of those firmware for reference: oakupdate.zip
I think maybe file_chunk_write should wait for the write callback to respond, before starting the timer. That way you won't fill up a send buffer if the connection is slow. Which could cause a large burst of data to be sent on a slow connection, I would guess. And this is nitpicking; but getFileSizeInBytes is kind of redundant, since you just read the entire file. So file.length holds your answer to the file size.
Sounds like the TCP stack in the firmware is broken, i.e. it can't request missing packages. I would hook up wireshark and look carefully of what is going on. If that can be verified (no re-sends are requested), maybe a solution creating a local network with a local server (i.e. in virtualbox or on a mobile device) could minimize package drop.
One way to capture with wireshark could be to share "internet" by Wifi on a Mac and run wireshark on the same machine, this to make sure to get as much Wifi traffic as possible.
Is there some way to get the source and replicate the build environment for the factory image? I realize that the final solution must work with the image already flashed, but for debug it would be helpful to see the code on the Oak and make changes.
The source is the same as attached in the second comment zip file (oakupdate.zip) - just comment out #DEBUG_SETUP to get exactly what is on a factory Oak
You can setup the environment to compile it by downloading the following and unzipping into Documents/Arduino/Hardware/
https://www.dropbox.com/s/dgb4qf1cooz3oba/oak_fallback.zip?dl=0 (Note still uploading right now, if you go to this link it will tell you when it is done uploading)
Then select Oak Fallback as the board, Single as the rom config, and hit upload (assuming you've already followed the steps above to set your wifi connection and domain/ssl thumbprint)
On Wed, Mar 2, 2016 at 2:50 PM, jldeon notifications@github.com wrote:
Is there some way to get the source and replicate the build environment for the factory image? I realize that the final solution must work with the image already flashed, but for debug it would be helpful to see the code on the Oak and make changes.
— Reply to this email directly or view it on GitHub https://github.com/digistump/OakCore/issues/54#issuecomment-191478476.
OK, working on getting my environment set up.
Should I expect any response to the "set" commands from the factory firmware?
Yes - just added, thanks for asking: The Oak should respond with {"r":0} if after each three lines (set, length, content) are sent - if you get {"r":-1} back then something is wrong with the input.
Also baud should be 115200
@jldeon - just added some more notes on how to confirm these changes and about timeouts for sending it all, see just under the "set" codeblocks in the top post
I've got everything set up and running now, and I'm deep in debug land. I think I've got a solid lead on part of what's going wrong, but not everything yet. Will try to keep you posted, assuming someone else doesn't figure it out first :)
Just started to get things setup to have a play with this, but im in the horrible(joke) situation where my Oak's update fine a good 80%+ of the time. 9/10 in a row this morning. ( The digistump server im talking about )
I've even tried Enabling WPA2 and forcing wifi channel 1 as was advised against. I've also tried hammering my connections up/down stream while updating as well. My only other AP is an Alcatel Pixi 4.5 in AP mode but the Oak just refuses to connect to that full stop.
Im happy to provide remote testing of anyone's WiP solution from a location that seemingly is blessed.
I've uncovered one major problem, and I believe the solution is fairly simple. It seems like having one failure (ie, one bad connection to the update server) causes cascading failures for the Oak.
My fix should alleviate this:
HOST LOOKUP OK
NO CLIENT
COULD NOT CONNECT TO UPDATE SERVER
UPDATE FAILED
endless loop.
It seems like the network stack is doing something silly, which is causing the server to become confused. Basically, TCP connections are established based on the client's IP and port, and the server's IP and port. When most web clients connect to a server, they pick a random port as their client port. The library on the Oak picks the same one every time (4097).
If the transmission of the firmware file goes south, the socket is left open on the server. Most web servers have long timeouts on these sockets, because they expect you to make multiple requests (ie, HTML file, and then a dozen images or JS files or whatever).
Now you reboot your Oak and try to connect again. The problem is, the Oak is sending a SYN using the same source IP and port (more than likely, if you're on a home router using NAT). The server looks at that packet, goes "I have a connection already" and drops it.
Meanwhile, the server is still waiting for acknowledgement on the last chunk of the firmware file it sent.
We can't fix the silliness on the Oak side, but we can more aggressively close the sockets on the server side. For Apache, try setting:
TimeOut 3
KeepAlive Off
In the VirtualHost section (or similar). This doesn't 100% fix the problem, but it does make it a lot less likely. KeepAlive Off is pretty sensible for the server, since we're not making bulk requests to it. The TimeOut parameter gives the client 3 seconds to acknowledge a packet if the send buffer is full (so by that point, you're already behind in terms of transmission).
On Node.JS, the timeout parameter of the HTTP server object appears to serve a similar purpose.
Ideally, we'd set something like TCP_USER_TIMEOUT
on the socket, but that would require either recompiling an existing server, or rolling our own. I'm not sure that the security and performance impact is worth the trouble, though.
Another option is setting /proc/sys/net/ipv4/tcp_retries2
to some lower value. This would effect everything OS-wide, though, so it's kind of the nuclear option. I believe the default at this layer is the value 15, which corresponds to more than 10 minutes.
Before these changes, if I interrupt the download from my local server (pull power while the 123123123 is scrolling), I'll get "NO CLIENT" errors over and over and over again. With these changes, I can pretty reliably flash my Oak on my local network.
Another way we can get stuck in this "NO CLIENT" state is if the web server tries to close the socket, but can't get the Oak to reply to its attempts to do so. /proc/sys/net/ipv4/tcp_orphan_retries
controls the number of times that a socket in the FIN-WAIT-1 state (ie, attempting to close) will try to tell the other side that it's closing the socket before just giving up. The default is 8, and I'd suggest that a much lower number could be used here.
I'm testing with 2 on my machine, and getting much better results.
@jldeon - some great discoveries, thanks! - just wanted to note that we have been setting /proc/sys/net/ipv4/tcp_retries2 to 4 which seems to be a good balance between closing when it shouldn't and allowing frequent retries - this is noted at the top of the server.js file, but this was a shot in the dark - you certainly figured out why this is necessary. In general I don't mind if any of the settings we need to change are OS level - this update server will run isolated on its own cloud/virtual server
@digistump Ah, should have looked at the node.js code :) I figured you guys weren't going to do anything else on this box, which is why I suggested some of those OS-level changes. I'm not sure how the kernel does math with that tcp_retries2 value, so I don't know what 4 means in the context of how long the socket will persist. I'd suggest trying some of the other settings, as those helped immensely.
If you want to see if a lot of people are stuck in the FIN-WAIT-1 state and would benefit from the tcp_orphan_retries
change, try running the ss
command on the server. The State column should show FIN-WAIT-1 if there are a lot of pending socket closures.
For the record, I'm doing no throttling whatsoever on bandwidth and not having any issues flashing the Oak over and over again. The update server and the oak are on the same high-speed LAN.
@jldeon before you implemented those changes were you getting SOCKET READ TIMEOUT? that is the main issue that seems to crop up for many.
What server are you using to server it? Apache (version, etc?) or Node or something else?
On Wed, Mar 2, 2016 at 6:53 PM, jldeon notifications@github.com wrote:
@digistump https://github.com/digistump Ah, should have looked at the node.js code :) I figured you guys weren't going to do anything else on this box, which is why I suggested some of those OS-level changes. I'm not sure how the kernel does math with that tcp_retries2 value, so I don't know what 4 means in the context of how long the socket will persist. I'd suggest trying some of the other settings, as those helped immensely.
For the record, I'm doing no throttling whatsoever on bandwidth and not having any issues flashing the Oak over and over again. The update server and the oak are on the same high-speed LAN.
— Reply to this email directly or view it on GitHub https://github.com/digistump/OakCore/issues/54#issuecomment-191553673.
@jldeon Do you have any issues receiving updates from the official server though? My local server works fine as well, but I have no issues with the official so cant really debug.
@digistump I get the occasional SOCKET READ TIMEOUT, (10% of the time, or so?) in looking at the tcpdump when it occurs, the server is retransmitting the packet but it's not getting to the Oak (at least from the tcpdump on the server, I see no ACK). I don't know that there's much that can be done about this, though, since we're dealing with wi-fi.
I'm going to keep digging on that, though, now that I've done what I can with this issue.
@DarkLotus Yes, I tried last night and this morning to update with the official server on 2 Oaks, and it failed on every attempt.
Ah at least you can reproduce :) if you need a vps or anything to test from a remote server let me know.
@DarkLotus Thanks for the offer! I think I've probably got it covered. I've got a couple of VPSes, credit to AWS, credit to Azure... probably some other junk if I dug around a bit :)
@digistump Totally flaked on your questions. The box I'm using currently is Ubuntu 14.04.4 LTS, 32-bit, testing with Apache 2.4.7 (latest available in the Ubuntu repos) currently.
@jldeon - when it failed on every attempt last night against the live server, was it due to NO CLIENT or SOCKET READ TIMEOUT?
On Wed, Mar 2, 2016 at 7:04 PM, jldeon notifications@github.com wrote:
@digistump https://github.com/digistump Totally flaked on your questions. The box I'm using currently is Ubuntu 14.04.4 LTS, 32-bit, testing with Apache 2.4.7 (latest available in the Ubuntu repos) currently.
— Reply to this email directly or view it on GitHub https://github.com/digistump/OakCore/issues/54#issuecomment-191556934.
@digistump I was running the factory build, so there was no debug output. I posted what I had to the forums: http://digistump.com/board/index.php/topic,2034.msg9360.html#msg9360
I can reliably reproduce the "SOCKET READ TIMEOUT" error by pointing my Oak at my VPS, and I've been digging into it for the last hour or so.
It doesn't look like an actual packet timeout, it looks more like some sort of weird conflict or maybe a race condition? I see a lot of retransmitted packets on both sides.
I've got to sleep now, but I'll try and craft an experiment to test this tomorrow if I've got time.
If anyone is around ive spun up a CentOS 6 server with Apache 2.2 just in case reverting back to Apache 2.2 is the fix.
set
43
{"first-update-domain":"oak.jameskidd.net"}
set
90
{"first-update-fingerprint":"07 99 62 B4 0C 4D 4D 59 22 1F 62 A2 83 04 0B E5 00 94 BE 18"}
At least on my end i can confirm this works at 100% 5/5 times thus far.
Will setup a ubuntu 14.04 apache 2.4 setup next on the same host, and see if i can get some time-outs reliably happening.
set
44
{"first-update-domain":"oak2.jameskidd.net"}
set
90
{"first-update-fingerprint":"07 99 62 B4 0C 4D 4D 59 22 1F 62 A2 83 04 0B E5 00 94 BE 18"}
This one is Ubuntu 14.04 with apache 2.4 All stock as well. Still I cant reproduce socket timeouts reliably, I start to wonder if its router chipset related or something.
When I see that there are more problems with high speed networks it kinda ring a bell for me : MTU ! Packet fragmentation might not be handled very well by simplified / old IP stacks. I remember it being a problem in the early days of DSL internet access.
So I don't have time to test it myself currently, but trying to lower the server network interface MTU might help there ?
On a different subject, I would be happy to test if any server you guys put out there are an improvement : I am "lucky" enough to not being able of upgrading any of my 3 oaks here at home (Cable internet), and have a 3.3V able USB/TTL adapter available. Just don't expect lightning speed replys.
@DeuxVis oak2.jameskidd.net is running with MTU set to 576 if you want to give it a shot. and oak.jameskidd.net is running apache2.2 on CentOS 6 to try and match Erik's original server
I have 100% failure rate with the MTU lowered to 576, Socket read timeouts. So Packet size could definitely be a factor in this. Bed time for me, will look at it again tomorrow.
I will try that later at home. I just successfully upgraded one of my oaks on first try at the local hackerspace, so it's not the place for testing. The difference is that here we have a fairly recent DSL "box" (modem) while my home cable modem - where oaks fails to update - is probably a 10 years old design.
I've been digging into a dozen or so packet captures to try and figure out the SOCKET READ TIMEOUT
issue. It seems like the TCP stack on the ESP is not particularly great at handling packet loss. Every time it fails, I see a pattern of packets dropped and multiple retransmissions of old packets and ACK packets going back and forth. Eventually there hasn't been valid data in long enough that the Oak gives up on the connection.
set
45
{"first-update-domain":"oakotafallback.digistump.com"}
set
90
{"first-update-fingerprint":"98 66 d5 5c 3d 4a 49 24 e3 1b 72 8b 8f 2e 65 2e 32 2a 7b 95"}
Edit: The official version of the Digistump fallback server is live, so this comment now points to that instead of my AWS instance.
It's on an Amazon AWS micro instance, and should be able to handle at least a few concurrent connections at once.
Expect the download to take around 3 minutes or so.
You'll want to do everything you can to ensure a good connection to the internet, as packet loss tends to be fatal. I suggest dropping your router to B only (on the 2.4GHz band) if you can.
If you do end up with a SOCKET TIMEOUT
error on the Oak, you're highly likely to get a few NO CLIENT
errors until the socket times out on the server side, as per my previous long-winded comment. You might want to just unplug your Oak for 15-30 seconds or so, which ought to make it time out a bit faster. I think the repeated connection attempts cause the socket to live a bit longer.
I went from constant SOCKET TIMEOUT
errors (nearly 100%) to only very occasional errors with this setup. If this ends up working, I'll share the custom server code :) Otherwise, it's back to the drawing board for part 3.
My Oak's been flashing from my custom server for a while now, I'm up to 37 successful transfers and 3 failures on the server side. Much better than 0 successes last night :)
The serial logs show 4 SOCKET READ TIMEOUT
messages, and a total of 14 failures (the rest being chain NO CLIENT
failures)
EDIT: Hit 100 attempted transfers on the server side. 88 successes! Attached is the serial log.
Just tried your setup out, First attempt failed at around 70-80% with socket timeout. Second attempt failed a little earlier around 60% with socket timeout. Third attempt failed a bit further than the second. No "NO CLIENT" errors though. Log output
Seems strange i have almost 100% success when running off any of my VPS's hosted in australia, but the setup that works for you, fails for me. All i can think is Latency would be much higher for me on yours though as its on AWS.
Interested to know if oak.jameskidd.net works for you. as im at 14/15 success off that.
@DarkLotus do you have only B enabled on your router? I've been testing with B/G/N and getting around 30-50% failure rate. B only is closer to 90% or so.
Nope, im using B/G/N but from my own tests it made no difference with mine or eriks servers.
Can't disable G/N right now ( Got long running tasks connected over wifi at the moment ) though will be able to in a few hours time.
B-only has been about the change that made the most impact in my testing so far. Can you keep attempting it even with B/G/N? Would be good for just load testing if nothing else...
I can see your connections to the server and the failed transfers. I'm busy at the moment, but I may be able to packet capture your attempts at some point, although I bet they're pretty much the same as mine.
Will do, little busy here as well but will keep pushing some attempts thru, will let you know once I've swapped over to B
Your servers returning internal error 500 btw. Which gives Update Success and then soft bricks the oak.
@DarkLotus Yeah, I just found and fixed a pretty major bug with concurrent connections. Apologies for any inconvenience. Your attempts to flash helped me find it :)
If one client failed to download, it was aborting all the download threads. My bad for not fully understanding the way the objects were working in python twisted.
I think what ever you did might have solved it, 2/2 worked just now.
Nice! I saw your attempts go through successfully.
I've got to sleep now, but I've set up 2 Oaks on my desk in the "silent" debug firmware loop that are flashing continuously. I've had 1-2 failures over 25 or so attempts. I'll leave everything overnight and check it in the morning.
Feel free to abuse the server in the meantime...
Overnight testing is complete, the server reports 713 total transfers, with 37 failures (so roughly ~680 successes). I don't have serial logs to see the results from the client side, since my computer eventually fell asleep. The Oak failure count is probably at least a bit higher.
I'll continue refining and testing today, but I could still use more testers pointed at the server...
@jideon I will test your server tonight. Weekend finally arriving ! Note that my modem/router doesn't allow to disable B G or N.
@DarkLotus Is your server with reduced MTU still available ? I might test it as well.
I previously tried to use beta 0.9.3 and the update worked pretty well. Now I thought I'll go for 0.9.5, restored my Oak and tried the update a few times without success. Then I read through the comments above and as it seemed like @jldeon has a server running with some improvements, I thought I will give it a try. Result: Update worked on the first try
If more testers are needed, I think I could set up an Oak in the endless update loop and let it run for some hours...
@fri-sch sure, if you can test and capture your serial output for a while, that'd be a useful data point. Any info about your router and your connection to the internet would be helpful as well.
I've got 2 oaks running in a continuous loop against my server testing today. Server logs indicate around 400 successes (including fri-sch's attempts) and around 61 aborts before the file was completely transferred.
I'm still refining things as I have time. I tested a bit with my Android phone as a wifi hotspot - data coverage in my area is pretty spotty. I was only successful about 20% of the time. I've captured some traces, but I'm not sure how fixable the issues are going to be from the server side.
@jldeon Great work! I think the hotspot test sounds interesting, if one could get a small server running on the hotspot device (well, not sure that is really within the requirements) that could be a way to minimize packet loss. The package only needs to travel over the one "link". Lowering the MTU would increase number of packages, and could maybe increase the probability for packet loss.
edit: Maybe this could work out of the box? (Pro have https) https://play.google.com/store/apps/details?id=org.xeustechnologies.android.kwspro
@jldeon the update loop is running for some time now and I will leave it on for the night. My router is an AVM FritzBox 7362 and wifi is setup for b+g+n and uses channel 4. Let me know if you need any more info...
@epatel Yeah, I was considering the possibility of deploying a small HTTP server on people's device or local network or something. The problem is SSL. The factory firmware has the domain and thumbprint of the digistump server baked in, and it seems like it can only be changed via serial (which not everyone has).
I'm not an SSL expert, but I believe that means we'd have to distribute copies of the key plus have people override their DNS to point to their own local copies. In addition to being painful, it destroys any security SSL brought to the update process.
@fri-sch Any stats on your success/fail rate? I'm wondering if it's any better or worse than my 2 Oaks. Thanks for helping test!
@jldeon Ah, thought there was some flexibility on setting the loading params as they described. I thought it maybe could be possible to do self signed certs for the hotspot device local address. But then, could it be possible to make a buffering proxy/router? The goal I am thinking of is to eliminate any package loss from wan transport.
@jldeon, @DarkLotus - awesome work so far, many thanks for all the effort you are putting into this!
And thanks for all of those helping test as well!
@epatel, @jldeon - The domain, url, and thumbprint can all be changed over http as well, so those can be changed by the config web app if needed. Of course, a remote server is preferred but certainly a remote server setup that was as good as it possibly could be, and some sort of local solution for those it still didn't work for seems like it could be the best of both worlds.
On Fri, Mar 4, 2016 at 2:40 PM, jldeon notifications@github.com wrote:
@epatel https://github.com/epatel Yeah, I was considering the possibility of deploying a small HTTP server on people's device or local network or something. The problem is SSL. The factory firmware has the domain and thumbprint of the digistump server baked in, and it seems like it can only be changed via serial (which not everyone has).
I'm not an SSL expert, but I believe that means we'd have to distribute copies of the key plus have people override their DNS to point to their own local copies. In addition to being painful, it destroys any security SSL brought to the update process.
@fri-sch https://github.com/fri-sch Any stats on your success/fail rate? I'm wondering if it's any better or worse than my 2 Oaks. Thanks for helping test!
— Reply to this email directly or view it on GitHub https://github.com/digistump/OakCore/issues/54#issuecomment-192504546.
Skills Required: System Admin, Node, C++, Linux Sockets, ???? Difficulty: Unkown
Challenges/Thoughts:
The technical limitations: The firmware shipped on the Oak is solely there to let you configure your wi-fi and get the first update, it, of course - given the Oak does not have a USB interface, cannot be changed except for with this update, so any changes to make this work have to happen on the server side of things.
The technical details: The Oaks preformed very well in our early testing at getting the update, before we sent in the firmware to the factory, while this was tested in a variety of ways the main development setup had the update file hosted on an Apache 2.2 server on Amazon EC2, and was being accessed from our machines over a 3mbps DSL connection. We also tested with it locally on our b/g/n wifi network. Our routers were run stock for testing with b/g/n enabled and auto for channel. - Between this point where we approved the firmware to be burned to the units and when people started to report issues the following changes occurred (or may have occurred): The hardware went from prototype to production, with likely ever so slightly different parts that should not have effected any performance and the update file was moved to an Apache 2.4 server on Ubuntu 14.04 on Digital Ocean (and then to various server setups - see github for more on this). After this updates were still working very well for us, since we didn't think it was an area we needed to worry about I can't say for sure if one ever failed, but they never failed enough to even catch my attention as a possible issue before I started shipping the factory produced units. The final point, and given that the update still works for me more than it doesn't over DSL, but not always over the local network, is why I believe connection speed may be involved.
Issue specifics: Specifically the Oak seems to either disconnect from the server prematurely or fails to get the next packet/chunk of data from the server. This usually is seen on the oak as a Socket Timeout or the Oak restarts because the watchdog timer kicks in after it sits in a loop doing nothing for awhile. If the Oak makes it to the end of the update it works. This may have to do with WiFi interference (we don't expect you to magically make it work even if other things are on the same channel as the Oak) - we are just trying to get it to work with a minimal set of rules for the user (make sure your router is not on channel 1 is acceptable, make sure your router is set to B only and 1mbps is probably not). For some rather extreme router settings that seem to work, and support our idea that this is speed related please see this post by a fellow beta tester on the forums: http://digistump.com/board/index.php/topic,2046.0.html
Things we've tried: We've tried various server setups including apache and basic node.js ssl servers. We've tried using Node.JS to make a custom https server that slowly pushes chunks of the firmware to simulate a slower connection. This seemed 100% reliable at one point in our testing, and then it wasn't any more, no idea why - but it is worth noting that the linux sockets buffer this data anyway, so it was unlikely this was actually helping, but we really don't know for sure yet.
How to test:
where yourdomainorip is the domain or ip you are testing with, and the 00s are replaced by the SHA1 thumbprint of your SSL certificate. Your Oak will now expect this certificate, connect to this domain or IP, and expect the firmware file at /firmware/firmware_v1.bin (grab the latest firmware here to place on your server for testing: https://oakota.digistump.com/firmware/firmware_v1.bin)
NOTE You should have these strings ready to send as there is a 30 second timeout between sending the set and sending the json string.
The Oak should respond with {"r":0} after each three lines (set, length, content) are sent - if you get {"r":-1} back then something is wrong with the input.
To confirm you have changed the two parameters send over serial these two lines:
You should get back a JSON response that includes those two settings.
This will cause your Oak to endlessly loop trying to download the update, displaying a log to serial of its progress, and then rebooting and doing it again.
Acceptable solutions: The sky is pretty much the limit here, other than the firmware cannot be changed. The solution must run on a standard linux server (hey if you can get it to work with a windows server that'd be fine too), but can change the OS/server/etc in any way, this will run on a standalone cloud server. It does not have to be particularly performant (we can run many servers if we need to), though it must be able to serve the firmware to more than one device at a time. You can mess with linux sockets, you can write it in any language, you can use any existing software or packages - really we're open to anything. I have a fear it could be as easy as setting up Apache 2.2 again without any changes to the default linux socket setup (we've messed with that too many times now probably, without fully understanding - see comment at top of node scripts) - I don't think that's true as I'm sure I've tried that, but really even if it was that simple we would reward you the bounty.
Solution testing: Any submitted solution will be tested first by ourselves, and then by a selected group of users who have experienced issues updating and are able and willing to carefully test. If your solution works for most of these case we will reward the bounty to you. Partial bounties for partial improvements may be granted as well. Solutions that require the user to run something locally on their network may be considered, but preference will be given to server only solutions (not sure if that would offer any advantage, but thought I'd throw it out there).
Bounty
$1000 cash or $2000 credit or 200 OaksWon by @jldeonCash or credit is your choice. Cash to be paid via Paypal. Credit has no expiration but can only be applied to a single order and does not cover shipping (because that is how our shopping cart works, not because we want to be limiting). Oaks reward includes shipping. You can also pick a split between any of the options.
You may credit yourself in the files as well, leaving in tact existing licenses and credits.
Legal Stuff: We will choose a winner at our sole discretion. The winner will be the first pull request/comment that submits fully working code meeting the above requirements and following good coding practices, based on the timestamp of the pull request. Bounty will be awarded (or in the case of Oaks, sent) within 48 hours of confirming winner. Cash awards will be made in USD. This is not an offer for hire. All work submitted becomes the property of Digistump LLC to be used at our discretion in compliance with any associated licenses. Void where prohibited by law.