Closed cnlohr closed 8 years ago
Definitely in 0d65561 - I had a bit of a hard time since it /sometimes/ does seem to work now, just very, very rarely and fails with
A fatal error occurred: Failed to connect to the flasher (got CF48C1C9)
I have no idea how I would debug such a significant sudden change.
Of course it's 0d65561, affecting many people. When I spotted that, I prepared to write an angry bugreport, but investigating a bit more, I figured that for boards where max connection speed dropped, actual flashing speed in kbit/s didn't really dropped much. So the only thing I could report "you added new sub-tested feature, at the expense of a well-known working one!!!11". But saying such things to IT professionals is getting old, especially to such nice folks as Angus.
Hi @cnlohr,
Thanks for reporting this. It's the first I've heard of problems at higher speeds and I really appreciate you bringing it up. I have some ideas for debugging & resolving it.
Prior to #96, the ESP8266 autobauding feature was used to sync the baud rate to the device. This is often unreliable at higher speeds (in testing I found it was often off by a few percent), so since #96 an explicit baud rate is passed to the stub which then sets that. This reduced my incidence of occasional failures to zero, although it's less good that it raised yours to 100%!
Off the top of my head, the most likely problem I can think of is a race condition when changing rates. After it is executed, the flasher waits 11ms and then sends a greeting ("OHAI") at the new baud rate. This is what is coming over corrupted. At minimum, this delay should probably be higher to allow the host to set the baud rate on the adapter.
(Actually, come to think of it, the greeting should probably go the other way first - because the ESP8266's timing is reliable, but the host's is unreliable. So the ESP8266 should wait until it sees the host using the new baud rate, and then reply. I'll get you a change to test along these lines, ASAP.)
I also don't have any PL2303 hardware here, which might be the deciding factor. I'll try to get some over the next few days. Can you give a few more details about your setup, please?
@pfalcon wrote:
Of course it's 0d65561, affecting many people.
Do you have any basis for "many people"? PR #96 was merged to master 3 months ago, went into a stable release 2 months ago. This is the first bug report regarding it.
I prepared to write an angry bugreport,
Bug reports are very welcome. Angry bug reports are not welcome.
"you added new sub-tested feature, at the expense of a well-known working one!!!11".
I added the feature because what I saw in testing (and what Cesanta saw in testing) was to go from flashing 90% reliably at 460800 or 70% at 960800 to flashing 100% reliably at 1500000.
For the record (and given I feel the need to defend myself), the process around this change was quite careful to minimise the impact of regressions. The PR was open for a month and was updated several times based on testing and discussion.. The v1.0 stable release came out before PR #96 was merged, to minimise the impact on anyone who hit a regression (this timing was not a coincidence). The v1.1 release (which included #96) came out a month later after no bugs were reported in master.
saying such things to IT professionals is getting old,
Bug reports that include facts are welcome. Code review is very welcome. If you want to grumble about other people's professionalism then please find somewhere else to do this.
@cnlohr , sorry to digress - thank you again for the bug report, and I'm sure we can resolve this. I'll be in touch with a change that reverses the greeting direction, I have a suspicion this will solve it given the "sometimes works" nature of the bug. Otherwise I can hopefully reproduce it once I have similar hardware.
Angus
Do you have any basis for "many people"?
I saw it at try no.1. I saw 2-3 reports from esp-open-sdk, and I didn't really look hard. Neither me nor they reported it, so yep, pure projections on how many may be stumped by it.
PR #96 was merged to master 3 months ago, went into a stable release 2 months ago. This is the first bug report regarding it.
I see issues with Apache server which were investigated in 2008 and thought to be fixed then. The software unicorn-land.
the process around this change was quite careful to minimise the impact of regressions.
Sure, I know, great work!
Bug reports that include facts are welcome. Code review is very welcome. If you want to grumble about other people's professionalism then please find somewhere else to do this.
Yeah, I knew it would be hard to stay within bounds of cozy "code of conducts" while discussing this, especially that we aren't medics whose code of conduct for millennia was "do good, don't do harm" - that's why I skipped reporting it. So, all the above doesn't explain why widely (like, really widely, as widely as possible for such thing) tested, well understood, simple things was removed in favor of a complex, merely month or two tested by few people, thing. Actually, "well tested" vs "not well enough tested" can be skipped, simple and well-known vs complex and esoteric should give enough criteria whether it's worth to replace one with the other. YMMV
@cnlohr:
If you get a chance can you please try the fix/bad_hispeed_sync
branch. I don't know if this will make a difference, because I haven't reproduced the bug yet. But it seems like a promising fix. I also added some documentation for the stub build/test process, to try and make it easier if you want to try anything with it.
Although the stub is a change, it's a change in the direction of more documented and debuggable flashing process and it removes the need to workaround bugs in the binary ROM loader. Once this bug is fixed I don't expect anyone to see performance regressions as a result of the change, quite the opposite.
Gah!! I am so sorry, I will read everything above and try it soon, but, I wanted to report something. I ended up modifying the code to "keep trying" even after an error, like reconnecting, etc. When I did that, it looks like it always fails the first time and usually succeeds the second or third times. Flashes great after that. I will do the best I can to hit this up tonight.
Hi @cnlohr, glad to hear you're up and running again at an OK speed. Sounds to me more and more like a race condition when the adapter changes speed (it's receiving some bytes at the old baud rate, or at least before it settles at the new rate.)
Very curious to hear if the branch I linked above makes a difference.
I also ordered some px2303 adapters from taobao. When they get here I can try to reproduce.
Angus
PS Great project write up in IEEE Spectrum. Nice work!
I am using the $1 adapter from electrodragon. This is my standard adapter and they are like guitar picks for me. I believe I have about 20 or 30 now... So if there really is such a thing as a fake pl2303, then I have it!
The branch does not appear to effect the problem very much. I am still getting around a 50?% failure rate. I have modified my code to keep re-initializing and trying several times. Now, with a single burn command it always succeeds - just have to reset it a bunch. Though this is an ugly solution.
maxtries = 100
for tries in range(0, maxtries):
try:
esp.connect()
operation_func(esp, args)
break
except Exception as e:
print str( e )
if tries == maxtries-1:
raise "Too many attempts."
I've tried a couple things and nothing seems to make it sync any better.
I currently have logic analyzer dumps of the new version working, not working and the old version working. I am going bleary-eyed trying to find the differences, but indeed, I do see a difference. It does seem that when it fails, the chip is auto-adjusting to a lower baud than the specified baud rate. This also appears to happen on the older software, yet somehow is not an issue.
Is it possible the old software wasn't checking the response codes sometimes and just getting by anyway?
Re: The IEEE Article: Thanks! I didn't realize they were going to publish it so soon. I am going to try to make a video explaining the ins and outs of ColorChord.
I'm also really hoping to get more people on board with some of the ESP8266 stuff I do. I.e. high speed ADC access, my brand of webserver. Etc... We'll see though.
Interesting results. It is entirely possible that v1.0.1 wasn't checking some interim result, but I still suspect it's related to something lower in the stack.
My PL2303s turned up and they seem to be the design as the ones you have, only the PCB colour is different from the Electrodragon photo.
I only had time to do a quick test so far, but on my Linux system the one I tested works. With v1.1 I can reliably flash up to 2.5Mbit serial rate (a new personal best actually!). With older v1.0.1 I got 80% success rate at 1Mbit but at higher speeds it fails to sync (which is similar to the performance on other adapters, that limit appears to be at the ESP8266 auto-baud-sync end).
A log of the session, although there's not much extra detail there: https://gist.github.com/projectgus/f589448bed6cccb70d5cdc5ca6065360
Some photos of my setup - TX/RX/GND from PL2303 to a NodeMCU board, NodeMCU connected to USB power only so its onboard USB/serial adapter won't enumerate. Manually resetting each time via GPIO0/RST. (Please excuse black wire used for serial signal, running out of jumper leads at my desk!)
If nothing in the hardware setup grabs you as significantly different (power availability, maybe?), then I suspect drivers... What host OS are you using?
I'm also really hoping to get more people on board with some of the ESP8266 stuff I do. I.e. high speed ADC access, my brand of webserver. Etc... We'll see though.
I've been meaning to take a close look through your repos for a while. The details of the analog-TV-over-I2S hack in particular was something I'm keen to get my head around.
I just realised from the code you posted that your failure may be happening earlier than when I expected it to be. So this may be before the stub is even running!
Can you please post a log of the output from a failed session, with unmodified v1.1 (or current master, which is basically v1.1)?
Just flashed many times with the old code (pre Cesanta), everything went off without a hitch. New code: Failing a lot.
Just wondering how are you setting your baud rates? I always use STTY and it doesn't seem to let me select anything other than multiples of 115200. I could imagine 500k multiple bauds might happen to work much better with these adapters.
I am not sure what you mean by logs, though. It literally says:
esptool.py v1.2-dev
Connecting...
Running Cesanta flasher stub...
Failed to ping the running flasher stub (got CF48C1C9). Try lower baud rate?
I could attach a comparison of good/bad Saleae logic analyser dumps if you needed, though it's free, I'm not sure if you would want to deal with getting the software to read them.
GOOD GRIEF. 1 Megabit does work extremely well!!! I hate to just ignore the bug, considering it "feels" like it should work, but setting baud rate to 1,000,000 seems to work just about every time and I do <3 the added speed. Ok, maybe like 60% of the time, but I don't have any of the CF84 failures. It feels about as reliable as the old version.
OK, I'm glad you've got it working but that's weird that it seems baud rate dependent.
The esptool stuff uses pyserial to set baud, which is platform-specific from there on down. On Linux there are two different ioctls, an older one for "standard" POSIX baud rates and a newer one that can handle arbitrary rates. IIRC pyserial chooses the ioctl based on the baud rate. On Windows from what I remember it's an arbitrary rate in the API call. I don't know about OS X.
What OS are you using?
I am Linux.
Also, someday I might try to debug the not even getting to the point of getting an error error, where it just waits. This happens frequently (maybe ~40% of the time). But at least now it's a lot less inconvenient.
Interesting.. Weird I can't reproduce, I just re-ran some flashes at high POSIX baud rates and I still can't get any failures. Maybe my PL2303 is different, or I got lucky with it somehow...
Anything unusual about your hardware? (is it Raspberry Pi, or something else slower than a modern PC?) What kernel version (uname -a)?
If you have time, could you please send me some dumps of the output of "strace -tt -e read=3 -e write=3 esptool.py write_flash"? Ideally a couple where it succeeds and a couple where it fails, maybe at 1Mbit and also at a reliably bad baud rate?
I'd also be interested to see your Salae dumps if you're able to post them somewhere.
Thanks!
No, everything is standard desktop PC.
uname -a
Linux aspire 3.19.0-59-lowlatency #66~14.04.1-Ubuntu SMP PREEMPT Fri May 13 18:44:44 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
I will try to get the dumps this weekend.
P.S. I am now starting to want to solve the "doesn't connect" thing I have been seeing since forever. MAN it would be cool to have the thing program every time.
I AM A TERRIBLE PERSON.
An old vestige of my build systems was to send the 'reboot' command at 115,200. The ESP sometimes locked onto this and thereby prevented programming. I literally spent an hour or so getting all my captures and dumps together before I realized this. Now everything works every time at every baud rate.
ALL of the problems are gone.
Never mind. I spoke too soon. The 1M baud bug is still there sometimes. Collecting new dumps now.
Thanks for posting those captures, @cnlohr . Glad things aren't as broken as we feared, although something really odd is still going on here.
That capture is with an unmodified copy of current git master, yes?
This is not really unexpected, this happened a lot in v1.0 and is the reason why we should only "sync" (using the autobaud) at max 115200bps from now on. If a higher baud rate is specified with -b, the rate is set by the flasher stub in a more predictable way before flashing starts. This way we get both more reliable syncing and higher throughput.
But... I don't actually understand how the code from current git can be opening the serial port at a higher speed. Did you modify something in the code? What command line are you passing in?
OH! I am sorry, I should have clarified. I did modify the baud rate constant at the top of the current code. Should I not be doing that? I had no idea they were different!
Ah! Try using the -b
parameter or setting the ESPBAUD environment variable.
With added complexity comes reduced quick-hackability ;)
Sorry, I should have responded sooner.
So, using it that way does seem to increase its reliability, however, on a crash/reboot with GPIO0 low (Without depower, or hitting reset line) it seems to say "Loading Cesanta ..." but not get any further, ever. I suppose I need to just give up and wire in the reset line instead of operating in a perpetual GPIO low dev case. We can close the ticket now, and if I ever get any further, I will re-open it, but for now reliability is up.
For any other souls seeing this thread: tl;dr: Always keep the baud rate in the .py at 115200, but use -b on the command-line to increase baud rate.
The v1.0 stable release came out before PR #96 was merged, to minimise the impact on anyone who hit a regression (this timing was not a coincidence). The v1.1 release (which included #96) came out a month later after no bugs were reported in master.
Speaking of this, I don't see v1.0 on PyPI, so there's no easy fallback for people who hit issues with v1.1+. Can 1.0 be released there?
@pfalcon: The actual v1.0 release was v1.0.1, because I some metadata was missing in setup.py for the v1.0 upload.
https://pypi.python.org/pypi/esptool/1.0.1 https://github.com/themadinventor/esptool/releases/
Thanks!
I will try to binary search a little later if someone doesn't beat me to an explanation, but up until I updated my esptool about 3 weeks ago, I could easily burn at 460,800 with my pl2303. Now, I can't flash anything unless I set my speed down to 115,200. I can go back and use the older esptool and it works. Does anyone know what could have changed in the last few months?