digistump / OakCore

Arduino/Platformio Core for Oak including Particle library
GNU Lesser General Public License v2.1
54 stars 28 forks source link

Fatal Exception 28 #16

Closed DarkLotus closed 8 years ago

DarkLotus commented 8 years ago

Ran Wifi Config, connected to cloud, received update. confirmed update with /system-version

I wiped out my 0.9 in ardunio and installed 0.9.1.

Built blink sample with Particle.delay. Flashed over the cloud then

Fatal exception (28): epc1=0x40001800, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00637ff0, depc=0x00000000

I grabbed the boot output as well which is as follows:

ets Jan 8 2013,rst cause:2, boot mode:(3,7)

load 0x40100000, len 3632, room 16 tail 0 chksum 0xc0 load 0x3ffe8000, len 352, room 8 tail 8 chksum 0x82 csum 0x82

OakBoot v1 - N,BP,0

Tried grounding pin 10, but doesnt appear to change anything.

DarkLotus commented 8 years ago

Okay i read the oakboot source.

Grounding pin 10 it boots with N,BU,0 instead of N,BP,0

which i hadnt noticed before.

But then it spams Fatal 28 the same as above since its loading the same rom i guess.

digistump commented 8 years ago

Anyone getting a Fatal Exception error after successfully connecting to Particle and trying to upload a sketch, if you can please do the following, the more people who can do this the better I can diagnose the issue (as I haven't been able to reproduce yet):

1) Make sure you have python installed and in your path and a serial adapter to connect your Oak. 2) Grab an unzip this script: https://github.com/digistump/OakCore/files/102522/esptool.zip 3) Connect pin 2 to Gnd and power up your device. This puts the hardware bootloader in serial mode. 4) From a command line with your Oak attached via Serial Adapter run "esptool.py --baud 115200 --port COMX read_flash 0x000000 0x300000 flash_dump.bin" 5) Email (don't upload as it may be possible to get your network password from this file, don't send if you aren't comfortable sending that to me, I will of course not use or keep it) the flash_dump.bin file to support@digistump.com 6) Comment as to specifically what error you are seeing - uploading or pasting the serial log is even better.

File for dumping flash: esptool.zip

digistump commented 8 years ago

Also if anyone seeing this error wants to try this - unzip the file attached to this and put it in the same folder as esptool.py from the post above this.

Follow 1-3 above then run the command "esptool.py --baud 115200 --port COMX write_flash -fs 32m 0x3fc000 blank.bin 0x3fd000 blank.bin 0x3fe000 blank.bin 0x3ff000 blank.bin"

Please do a flash dump per above first, then try that, and let me know if your unit can then boot - again serial log is best.

blank.zip

nkildal commented 8 years ago

Hi Erik

My Oak is spamming with Fatal 28 exceptions when I power it up (also with pin 10 grounded) Trying to dump from it, but I get the following:

MacBook-Pro-2:Oak_debug nicolai$ ./esptool.py --baud 115200 --port /dev/tty.SLAB_USBtoUART read_flash 0x000000 0x300000 flash_dump.bin Connecting... A fatal error occurred: Failed to connect to ESP8266

I also tried grounding pin10 before applying power to the board, but with the same result. Also tried using a baud rate of 74880 - also without success.

Do I need to something else to get the flash image dumped?

/Nicolai

digistump commented 8 years ago

@nkildal Sorry - I missed an important part in my instructions - please connect Pin 2 to GND and then power up and dump the flash - same for writing files to flash - this puts the device in the serial bootloader mode - you'll need to power cycle it after a write or dump to put it back in the serial bootloader mode again.

nkildal commented 8 years ago

No problem - it worked! I've attached my dump file.

Removed for security reasons.

digistump commented 8 years ago

Thanks @nkildal - if you don't mind trying the possible fix (it's a bit of a shot in the dark, but won't disable the Oak any further) - I'd love to hear if it works for you, I'll take a deep look at the dump file later today or first thing tomorrow, but at a glance nothing obvious yet.

nkildal commented 8 years ago

I tried writing the blank.bin file:

MacBook-Pro-2:Oak_debug nicolai$ ./esptool.py --baud 115200 --port /dev/tty.SLAB_USBtoUART write_flash -fs 32m 0x3fc000 blank.bin 0x3fd000 blank.bin 0x3fe000 blank.bin 0x3ff000 blank.bin
Connecting...
Erasing flash...
Wrote 4096 bytes at 0x003fc000 in 0.4 seconds (85.1 kbit/s)...
Erasing flash...
Wrote 4096 bytes at 0x003fd000 in 0.4 seconds (85.1 kbit/s)...
Erasing flash...
Wrote 4096 bytes at 0x003fe000 in 0.4 seconds (85.1 kbit/s)...
Erasing flash...

A fatal error occurred: Failed to enter Flash download mode (result "0x1, 0x6")

Powering down and up again, gives me repeating fatal (28) exceptions on serial...

digistump commented 8 years ago

Looks like the final one failed @nkildal - can you try just this:

esptool.py --baud 115200 --port COMX write_flash -fs 32m 0x3ff000 blank.bin

nkildal commented 8 years ago

Here goes:

MacBook-Pro-2:Oak_debug nicolai$ ./esptool.py --baud 115200 --port /dev/tty.SLAB_USBtoUART write_flash -fs 32m 0x3ff000 blank.bin
Connecting...
Erasing flash...

A fatal error occurred: Failed to enter Flash download mode (result "0x1, 0x6")
digistump commented 8 years ago

And a little more background - what we're trying here is blanking out where the Oak stores wifi config data - not the SSID and passcode which is saved in the particle config area and CRCed on save, but the data it uses to reconnect quickly to a known good network - it is the one area I have no control over as it is written inside the ESP8266 SDK, which is why I suspect it might be at fault.

digistump commented 8 years ago

blank2.zip Hmm @nkildal - how about this file with

esptool.py --baud 115200 --port COMX write_flash -fs 32m 0x3fe000 blank2.bin

If that doesn't work I'll hold off on any more hunches until I successfully reproduce the issue so I can test locally.

If that doesn't work - if you don't mind doing a full dump (the one before was only the most important 3/4) so that maybe if I load your full rom into one of mine I can reproduce the issue:

esptool.py --baud 115200 --port COMX read_flash 0x000000 0x400000 flash_dump_full.bin

Probably best to email the result as I realized that it could present some security issues, it is possible to deconstruct the file and get someones WiFi password - of course don't send it, if you are not comfortable sending that to me (I won't use it nor even care to look at it, but I want to be clear as to what is being sent). I removed the one you uploaded earlier for the same reason.

Thanks for all your help.

nkildal commented 8 years ago

Ok - tried blank2.bin:

MacBook-Pro-2:Oak_debug nicolai$ ./esptool.py --baud 115200 --port /dev/tty.SLAB_USBtoUART write_flash -fs 32m 0x3fe000 blank2.bin
Connecting...
Erasing flash...
Wrote 8192 bytes at 0x003fe000 in 0.8 seconds (85.2 kbit/s)...

Leaving...

Disconnecting power, removing GND on pin2 and powering up again, still gives me a steady green light - serial console still shows scrolling fatal exception 28 :-(

I'll go ahead and produce a full dump, and send it to you in an email - I have no issues sending you the data - just happy to help out :-)

digistump commented 8 years ago

Awesome @nkildal - thanks again - next chance I get I will try loading it into an oak and see if I can reproduce

DarkLotus commented 8 years ago

Here is a dump of mine as well, wifi passwords are for suckers, so nothing personal in here.

My AP it "should" be connected to is called "KKLOLK" Another AP within range is called "belkin.743c" and the third one in range is "Speedweb" something

I noticed at offset 0x100B1D and 0x200B1D there is a string "KKLOLK.743c" which seems strange? flash_dump_full.zip

Could i dump one of my working oaks and flash it? or would that give me a world of pain and conflicting device id's lol

Bit off topic but, if i power my oak from dedicated wall psu, rather than my USB hub I would then need a ground line for serial yes? I assume im getting away with tx/rx only since my stlink and the oak are powered via same USB hub currently?

If that's the case just a jumper between grnd on the oak and grnd on the st link would be all thats needed?

Using an ST-Link v2 from the top of a STM nucleo development board.

digistump commented 8 years ago

Awesome - thanks! @DarkLotus

I wouldn't advise moving a dump from one to the other - you could dump a good one from 0x0 to 0x201000 and then replace the device ID's with the ones from this dump - you'll find them just past 0x100000 and 0x200000 but ymmv

DarkLotus commented 8 years ago

think ill wait, i dumped a good firmware and there is quite a few differences apart from the user rom sections.

digistump commented 8 years ago

@DarkLotus Would you mind compiling and uploading the bin file produced by the same Blink example you flashed with Arduino IDE - hit "Verify" instead of "Compile" and then grab the path to the bin file from the output at the bottom of the window. (You may need to turn on Verbose compiling in the preferences). I've got a way to fix the bricked units, and some ways to avoid this from bricking units, but still no idea why this occurred unless the compiled bin file was actually bad to start with.

DarkLotus commented 8 years ago

sketch_jan22a.cpp.zip Here you go. built with the same ardunio / oaktools etc install as what killed mine.

DarkLotus commented 8 years ago

Comparing my bad bin to the sketch bin, its byte for byte there. so my compiled sketch must be bad?

digistump commented 8 years ago

@DarkLotus yes that's exactly what I'm thinking - what OS was this compiled on?

Mind also uping the exact sketch and the .elf file from the same folder as the bin? - I'll compile the sketch and compare, and the elf can be used to disassemble and see where the error occurs

digistump commented 8 years ago

@nkildal - What OS were you using? Any chance you can upload the sketch, bin and elf as requested of @DarkLotus above?

Thank you both for helping unravel this mystery!

nkildal commented 8 years ago

@Erik: I'd love to, but it has to wait until I get home from work in about 8 hours - 7:15AM here in Denmark :-) If you've not fixed the issue by then, I'll upload the files as soon as I get home. My OS is Mac OSX El Capitan 10.11.

The sketch was the blink example with the two Particle.delay() lines..

digistump commented 8 years ago

@nkildal - thanks! that is actually all I need to know, there is a very definite pattern developing here - if @DarkLotus says he is on OSX and even more so if he was on El Cap - then 100% of people who have written to me with this issue are on El Cap

DarkLotus commented 8 years ago

Compiled on Linux x64, Ardunio 1.6.5

sketch_jan22a.elf.zip void setup() {
// initialize the digital pin as an output. pinMode(1, OUTPUT); //LED on Model A
}

// the loop routine runs over and over again forever: void loop() { digitalWrite(1, HIGH); // turn the LED on (HIGH is the voltage level) Particle.delay(4000);
// wait for a second digitalWrite(1, LOW); // turn the LED off by making the voltage LOW Particle.delay(1000); // wait for a second }

I can do a build on OSX 10.10 or Windows as well if needed

ripred commented 8 years ago

@Erik: Can confirm, once Nicolai Kildal and I found some issues with oak.js via comments on the KS comment section on OS X (me El Capitan 10.11.3) and I initiated the upload of my first sketch, it was bricked after that. My serial adapter is still somewhere in a box from my move from California so haven't been able to contribute to this thread. But yes, El Cap here. Firmware upgrade seemed to go fine, registered with particle fine, good to go. First upload via the oak.js (via Node.js and corrected code) seemed to brick it. Powered via USB on Mac Pro desktop. For what clues it might add I'm an IOS dev with probably their latest updates to gcc/LLVM &c. if that matters at all.

digistump commented 8 years ago

@ripred - Thanks for the info!

@DarkLotus, @nkildal

Here is what I've figured out this far, in case any of you are interested in the details: 1) Compiling on Windows vs OSX and Linux 64 produces a very different bin file. (Unconfirmed but given that I developed on Linux32 as well, Linux32 may produce something similar to Windows) 2) The differences are from compilation - not related to esptool2 or oakcli. 3) This has nothing to do with the OTA upload itself as that gets CRC checked 3 different ways (one on a packet by packet basis during transfer, one at end of transfer, one on boot) 4) This exposed a bug in the ESP8266 SDK (aka we can't fix it) where the ESP8266 goes into a fatal exception loop without rebooting like it is supposed to - this prevents our bootloader from kicking into safe mode (aka config mode). 5) This also exposed a bug in our code that prevented the GPIO Pin10 recovery mode from working. (a flag wasn't set to enable it)

To deal with #4 I've added a bunch more logic around when to consider a new rom file as good, namely instead of just booting it needs to actually run, and also some extra logic for when to jump into this mode. I've also fixed the Pin10 entry into config mode.

I've booted one of my computers into OSX to address the main issue, but haven't got too far in terms of results yet - things/thoughts on my list:

1) Is it 32 vs 64 bit related? As in was the compiler for 64 bit machines compiled differently to cause this error. 2) Decompile both ELF files and compare. 3) Decompile the bad ELF file and find where the exception occurs. 4) Use the bad bin file to test the new logic for when to enter safe mode. 5) Try using the latest files from ESP8266 core, see if that fixes anything, look at commits for any hints.

Thank you all again for your help! More tomorrow!

digistump commented 8 years ago

I thought I'd add a bounty on this - in case someone wants to figure it out before I get back to it tomorrow.

TASK: Figure out why sketches compiled on OSX/Linux64 are failing with an exception loop and implement a fix.

HOW TO TEST:

Manually upload the bin file created when you compile the sketch, using the following command (blank.bin is attached: blank.zip ):

esptool.py --baud 115200 --port /dev/tty/YOURPORT write_flash -fs 32m 0x1000 blank.bin 0x101000 blank.bin 0x2000 YOURSKETCH.bin

Confirm that it fails with a fatal exception loop on the serial monitor with the current install. Then confirm that it succeeds (by watching via serial monitor and making sure the blinking executes) with your fix.

NOTE: blank is just being used to overwrite the bootloader config (and backup copy of the bootloader config) to force it to boot into rom 0, where the bin is being written

BOUNTY: $150 cash or $225 store credit or 25 Oaks (your choice) - subject to terms of the otehr bounties listed on here. Claimed by DarkLotus

Bootloader layout for those curious why blank is being written to those locations:

0x000000 boot
0x001000 boot config
0x002000 rom0
0x100000 particle config 
0x101000 boot config backup
0x102000 rom1
0x200000 particle config backup
0x201000 boot key sector - direct write area
0x202000 rom2
0x300000 1MB filesystem
0x3fb000 EEPROM sector
0x3fc000 sdk config (4 sectors)

Sketch for testing:

void setup() {

// initialize the digital pin as an output.
pinMode(1, OUTPUT); //LED on Model A

}

// the loop routine runs over and over again forever:
void loop() {
digitalWrite(1, HIGH); // turn the LED on (HIGH is the voltage level)
Particle.delay(4000);

// wait for a second
digitalWrite(1, LOW); // turn the LED off by making the voltage LOW
Particle.delay(1000); // wait for a second
}
DarkLotus commented 8 years ago

Just to let you know your on the right track, i flashed the same sketch from windows. and my dead oak is back to life. You missed the blank.bin, but if anyone wants to follow the above, you want a 4kb blank.bin blank.zip

DarkLotus commented 8 years ago

Played around with it for a few hours, even tried using o's from windows and just doing >elf > bin on linux. still no go. Sleep time here. Good luck who ever works it out while i sleep :p

digistump commented 8 years ago

@DarkLotus - when you used the o's from windows did you grab all of the o's and build the arduino.ar on OSX or did you grab the arduino.ar from windows as well and just run the final linker command on OSX?

DarkLotus commented 8 years ago

So i just woke up, and i've a working firmware built on linux x64 Just moving the o's didnt work, but I just tried with all the o's and ardunio.ar and it works!

I moved everything from a windows build except the .d's and .cpp's then re ran compile in ardunio and I get matching size of the windows build and it works after flashing it.

digistump commented 8 years ago

Interesting - so either the issue is introduced by xtensa-lx106-elf-ar or it is one of the o files and by not bringing over the arduino.ar, arduino was using the arduino.ar it made locally from a previous compile

digistump commented 8 years ago

I guess the way to test this would be to grab all the xtensa-lx106-elf-ar commands and the final gcc command, empty the folder and only put the windows o's in it and then run those commands to make the arduino.ar and then .elf from command line to ensure Arduino doesn't rebuild/replace anything

DarkLotus commented 8 years ago

which works, it's some where before linking i think? Best example is void ParticleProcess(void) is being stripped when built on linux, but is in there when using windows .o's

DarkLotus commented 8 years ago

Replacing -Os with -O in oak platform.txt appears to work! Just flashed and shes working! ParticleProcess is no longer optimized away. Will test more incase im jumping the gun here, but maybe its just one of the flags enabled by -Os

digistump commented 8 years ago

hmm - probably not coincidentally ParticleProcess is declared as C function, instead of C++

ParticleProcess isn't actually necessary any more - in core_esp8266_main.cpp it could be replaced by Particle.process(); and then removed from the top of OakParticle.cpp and bottom of OakParticle.h - I wonder if that was the whole cause and removing it would allow -Os to work, or if it was just an indicator

digistump commented 8 years ago

was mid comment when I saw yours - Awesome! do you mind trying my last suggestion? I will (I've been testing in OSX which seems to mirror the behavior), but I'm on the shipping line right now and not near my mac

DarkLotus commented 8 years ago

Didnt solve it :( Works with -O2 though ( with or without the Particle.process() change )

Not sure where to go from here so im going to try building a newer version of the xtensa toolchain.

Probably more to discover via Ida, but i dont know any of the ARM assembly, can barely do math in x86 assembly -_-

ripred commented 8 years ago

@digistump, @DarkLotus As a certified nerd and code junkie I find this conversation immensely enjoyable from a forensics/debugging perspective :-). Q: Would you feel comfortable putting up a publicly testable version with -O or -O2 in place and then work backwards offline to make it more optimized? Also curious if this is the same issue a few are facing w/windows 10. Or is this just a matter of changing my existing local Arduino15/... platform.txt and trying with a fresh Oak? Or is this disabled as far as the beta f/w being stopped?

DarkLotus commented 8 years ago

To test your self you just need to edit the /packages/digistump/hardware/oak/0.9.1/platform.txt file and change the compiler flags in there, three occurrences to change. Then just verify / compile in arduino IDE like normal to create your bin.

linux Os O2.zip bins

elfs.zip elfs

Attached is compiled with -Os(not working) and -O2(works)

ripred commented 8 years ago

@DarkLotus: Excellent thanks for the info and the zip. About to attempt the change (measuring 5 times before I cut :-) ) and then trying with a fresh Oak.

Thanks again

DarkLotus commented 8 years ago

Built the xtensa tool chain from https://github.com/jcmvbkbc/crosstool-NG/tree/feb1fb829d5cb4f8739e488cbd2f0b72304705bb

-O2 works still so i assume i built correct. -Os does not work still, no fatal exception 28, just bootloader loops. bin's are different sizes compared to the default xtensa toolchain as well at either optimization level.

Im out of ideas for now i think :(

Maybe losing -Os wouldnt be the end of the world, we lose 4kb of rom and 200bytes of ram, difference isnt too bad since we have room to play with

ripred commented 8 years ago

I can't test with a fresh Oak as the Particle registration and/or firmware download seems to be disabled at the moment. Looking hopeful though from the comments!

DarkLotus commented 8 years ago

Yeah probably best to wait a little longer till its sorted out, and yes firmware update is disabled at the moment, as if you flash a bad rom you currently need to re flash via Serial, not the Particle OTA way.

Since im out of ideas im trying the xtensa toolchain from master rather than commit #feb1fb8 as used in the esp-open-sdk which i tried earlier.

No go still, using latest xtensa #g1fbfd12 oak just boot loops with "E,BU,0" if you use -0s.

digistump commented 8 years ago

E,BU,0 (Exception,Booting Updater,Rom 0) would mean it is encountering an exception, probably very early in the binary if it doesn't get far enough into throw a proper exception message.

Maybe a different approach would be to build -Os but disable optimizations that -O2 disables or the other way around build -O2 and enable optimizations -Os enables until the one that causes the fault is found - could be an excellent starting point to find the actual issue in the code (if there is one and its not just some compiler crazy bug).

@DarkLotus did you try -O2 on windows? With the above method it might mean -O2 could be used with several other optimizations enabled too, resulting in a binary very close to the -Os size.

In fact I'd probably err on the side of using -O2 is -Os can create issues like this, since stability>size

DarkLotus commented 8 years ago

yeah -02 works on windows, I have not tried -01 bt i assume will work on either as it's less intensive than -O2.

i've been trying to track down which optimizations are on / off for -Os and -O2 but havent been able to work out a definitive list of whats on / off.

Im about to try the gcc5.2 branch of xtensa and see if that makes any difference, then try optimizations one at a time maybe -_-

edit Cant get 5.2 to work, Gotta drop this for an hour or two to do some real work, then ill try 1 by 1 optimizations, narrow down whats causing this.

digistump commented 8 years ago

@DarkLotus - thanks for all your efforts on this, and certainly don't let it keep you from real work!

I think you've definitely won the bounty as using -O2 "fixes" it - but I appreciate any further you can narrow it down.

Send me an email at support@digistump.com whenever you'd like to claim it, and let me know which prize form you'd like.

Thanks again!

digistump commented 8 years ago

oh one thought - have you tried OSX? If not someone better try that before we call it fixed - I can tomorrow if no one beats me to it.

DarkLotus commented 8 years ago

Can confirm -O2 works on OSX 10.10, Windows 10 x64 and Linux x64. i dont have a GUI install of a 32 bit linux to test on that though.

Been playing with disabling different optimizations, no luck.