Pinoccio / hardware-pinoccio

The schematics, board layout, and datasheets for the Pinoccio board
Other
55 stars 26 forks source link

Upload via bootloader times out for bigger sketches #9

Open matthijskooijman opened 9 years ago

matthijskooijman commented 9 years ago

From @jingman on November 12, 2014 1:19

Copied from original issue: Pinoccio/server-hq#191

matthijskooijman commented 9 years ago

From @quartzjer on November 12, 2014 1:43

FYI, I've seen the exact same behavior when flashing via the IDE and there's an stk500 timeout, sometimes it just blinks green, sometimes it's solid green, I think it just means that the flash was incomplete/corrupted and the bootloader is looping restart or stuck trying to boot bad code?

matthijskooijman commented 9 years ago

From @erictj on November 12, 2014 1:44

Regarding the stuck-green-light-while-flashing issue that seems to be cropping up--some observations:

Sounds like we have a couple of moving targets with regards to stable flashing via chrome: one is the ongoing automatic updates of the chrome browser at random times, and the other is the increasing size of bootstraps with most subsequent updates.

Perhaps a good approach for regression testing would be to make a custom bootstrap (maybe with a huge PROGMEM string initialized) to get the hex size close to, say 255k, and add that to part of the integration test. On every update of Chrome, we auto-run a 255k flash via chrome 10 times and ensure it works every time.

matthijskooijman commented 9 years ago

I think it just means that the flash was incomplete/corrupted and the bootloader is looping restart or stuck trying to boot bad code?

The bootloader checks address 0, if it is 0xffff it doesn't boot the sketch and restarts itself, which results in the blinking you get when you just burn a bootloader after erasing the chip (the bootloader does one blink each time it is run). When a flash somehow fails, it seems unlikely that address 0 is 0xffff, but perhaps it runs the sketch, which halfway does something weird and ends up doing a jump 0 or something.

Before the bootloader boots the sketch, it turns the leds off (that is - I think it accidentally leaves the pullups on, so they might light up very dimly, but they should certainly nog light up fully). This means that when the led lights up solid green, it never leaves the bootloader for some reason (one alternative explanation could be that the sketch jumps back to the bootloader before you can see the led is off, except that I think the led turns off halfway through the bootloader already).

As for the solid green led - it seems I can reproduce this often when I first plug in a scout in the morning. Replugging, or using the on/off switch is then sufficient to solve this, so this is likely a separate problem.

matthijskooijman commented 9 years ago

From @quartzjer on November 14, 2014 21:3

if it helps, I'm pretty sure I watched my scout (re)boot in the middle of a flash, which of course caused the stk500 timeout... so maybe the bootloader or the 16u2 is triggering a reset?

matthijskooijman commented 9 years ago

I dove into the timeout with avrdude aspect of this issue. It somehow looks like this is more likely to occur on one scout than another, but that might be dumb coincidence, or perhaps the two scouts I tested are running slightly different 16u2 or bootloader code, not sure (the "better" one was one of my first scouts from before we were officially shipping).

Anyway, I hung a logic analyzer on the TX/RX pins, and here's what happens:

So far facts, now speculation. I think that the bootloader misses a byte, or a checksum error occurs:

The value of this bootloader timeout is a bit ill-defined. There is a loop that checks the UART status which is executed F_CPU times, so the actual timeout depends on the instructions generated (but I'd expect 2-3 seconds, certainly > 1 second, which is inconsistent with my observations above).

The avrdude source is a bit vague on the actual timeout, but when no data is being received, it should timeout after 5 seconds. AVR068, the STK500V2 spec, says:

The total timeout period is from a command is sent to the answer must be completely received. The total timeout period is 200 ms for the CMD_SIGN_ON command, 5 seconds for the CMD_READ/PROGRAM_FLASH/EEPROM commands, and 1 second for all other commands.

Meaning our bootloader does not comply with the spec.

So, I'm not really sure what is the root cause of our problem, but a flipped bit or slightly off bitrate (I'd have to capture at > 1Mhz samplerate to check this) doesn't seem entirely unlikely here. In any case, it seems like mismatches between timeouts (in particular, avrdude doesn't retry fast enough) is the reason it doesn't recover from this. I'll see if I can dive into the bootloader to see what is the underlying cause here.

For HQ, we might be able to fiddle with retry timeouts and fix this there, for avrdude / Arduino IDE this might be more tricky.

matthijskooijman commented 9 years ago

I had a quick look at the chrome app code and whipped up Pinoccio/chrome-app-pinoccio#31. It's rough and untested, but it might just work. If it does, we'll probably need to polish it a bit more before merging, though.

matthijskooijman commented 9 years ago

I did a bit more digging today. It turns out my previous suspicions about missing a byte or failing a checksum weren't correct. The last datapacket (before the timeouts) is correctly received and writting to flash, but then the bootloader locks in boot_spm_busy_wait. I think it never finishes that loop, but I want to doublecheck that once more. I tried expanding that macro into the loop it contains and add a pin toggle inside, which seemed to decrease the chance of the timeout occurring. Also, the pin toggling shows that the loop actually runs, for just as long as when the timeout doesn't occur, which suggests that the problem might actually occur after the loop (still have to doublecheck that).

I'm not entirely sure how it gets from this wait loop to actually starting the sketch. I'm sure it doesn't just continue with the code as you'd expect. Sometimes it seems it passes the start of the main() function again (based on some pin toggles I put there), but sometimes it doesn't. Based on the MCUSR register, it is not regular reset (none of the bits are set), which probably means it's a stray jump to somewhere in the latter part of the flash (where it skips 0xff's until it comes to the start of the bootloader again).

Looking at the disassembly doesn't show anything weird though. I can't really explain what happens there though - at worst it could infinitely loop, but the pin toggle I added suggests that's not what happens (and the "reset" also disproves that).

Out of time for today. Summarizing, it seems something weird is going on. I can't really explain what I'm seeing based on the code. So either I'm missing some strange corner case, or there is some hardware problem (e.g. causing the CPU to behave out of spec)...

matthijskooijman commented 9 years ago

From @gvdw on December 9, 2014 11:18

If it's blocking in boot_spm_busy_wait (a pretty low level and simple avr-libc function) then something is really wrong!! Matthijs may be onto something with the timeouts. Perhaps what's happening is that a block is flashed, the AVR attempts to reply but for whatever reason the reply gets lost (USB glitch or 16U2?) and then avrdude doesn't retry again fast enough (we may not be waiting long enough) and the bootloader times out and boots the sketch. Otherwise, maybe Chrome is taking too long to send the next packet and the bootloader is timing out and trying to boot the sketch?

Could be useful to use a serial port monitor that timestamps the serial traffic so we can try and see where it goes pear shaped.

Also, making a sketch that manually sets an obvious RGB torch colour and while(1) as early in the sketch as possible (ie, in the setup function before Scout.Setup()) would help indicate if the bootloader has been left. Scout.setup() and Scout.loop() can be left in to chew up flash space, this sounds like a timing issue or lost message that is becoming more apparent/likely to occur with bigger sketch sizes.

matthijskooijman commented 9 years ago

The AVR attempts to reply but for whatever reason the reply gets lost (USB glitch or 16U2?) and then avrdude doesn't retry again fast enough

That's what I initially though, but it actually seems that the bootloader (running in the 256rfr2) stalls or locks up or otherwise breaks with the expected program flow, so it doesn't actually send any reply at all.

Also, making a sketch that manually sets an obvious RGB torch colour and while(1) as early in the sketch as possible (ie, in the setup function before Scout.Setup()) would help indicate if the bootloader has been left.

I'm debugging with a logic analyzer, so I can see the 'Hello from pinoccio' prompt show up on the serial lines when the sketch was booted, so that's already covered.

matthijskooijman commented 9 years ago

One more test run I did today shows that at some point during the flashing, it runs the boot_spm_busy_wait loop, but then never executes the pin toggle directly after the loop. Then it actually sends a STATUS_CMD_FAILED response, which shouldn't really be possible with the current code (it can only send that reply when the parameters are wrong, but then it wouldn't end up in that wait loop at all). It really just seems that the CPU just randomly jumps around after flashing finishes...

matthijskooijman commented 9 years ago

When I define a timer interrupt that triggers after every instruction and writes the current program counter to the second serial port, the problem stops occuring. When I set the timer to let the program to run for 16 us between interrupts (which themselves also take around 16us) the timeout still happens. The program counters that are output show that it is happily running around in a loop, waiting for flashing to finish and then the next interrupt the counter is 0xb7 (word address, 0x16e in bytes). This suggests that the program counter got reset to 0x0 at some point.

I thought that the problem might be that an interrupt got accidentally enabled and triggered, but the fact that my timer is still triggered disproves this (since interrupts are disabled when an interrupt triggers). To doublecheck, I globally disabled interrupts and the timeout still happens.

matthijskooijman commented 9 years ago

This suggests that the program counter got reset to 0x0 at some point.

Looking at the contents of the flash afterwards, it seems 0x0 contains a jmp 0xfaf2 instruction, so jumping to 0x0 does not explain this. In fact, the entire interrupt vector contains distant jumps, so you can only and up at 0x16e if you jump to somewhere > 0x130.

matthijskooijman commented 9 years ago

One more datapoint: Bypassing the 16u2 using a sparkfun FTDI breakout didn't help either, I got the timeout once and the command failed once. It's a bit tricky to test, because (presumably because of the extra resistor, and/or the capacitor already on the reset pin, I couldn't get autoreset to work.

matthijskooijman commented 9 years ago

From @gvdw on December 11, 2014 12:18

it only happens in the second half of the flash right? could it be some sort of near vs far issue?

when I was testing the bootloader originally I tested with sketches right up to APP_END, and I don't remember having any issues. so I wonder what's different about the current situation.

are you testing via chrome still or via direct serial/avrdude/arduino IDE?

matthijskooijman commented 9 years ago

it only happens in the second half of the flash right? could it be some sort of near vs far issue?

You'd think so, but I can't really tell what or how. Also, I'm not sure if the cutoff point is really 128k, I just have never seen it happen "soon" in the upload process, only near the end.

when I was testing the bootloader originally I tested with sketches right up to APP_END, and I don't remember having any issues. so I wonder what's different about the current situation.

There seems to be a great deal of randomness involved. The timeout doesn't always happen, never happens at the same point in the upload (but always later rather than sooner). I've had runs where the upload works a dozen times or more in a row, and a bit later it breaks every other attempt. I'm wondering if there might be a temperature-related factor in there (this morning when I started, it wouldn't break, making me think I changed something somehow. After a bit of fiddling, it timed out again).

are you testing via chrome still or via direct serial/avrdude/arduino IDE?

Using avrdude directly (copied the commandline from the IDE).

I did some measuring of VCC as well. I modified the bootloader to drive the BKPKVCC-EN pin high, which should connect the 3V3 pin on the header to VCC. This shows a bit of ripple on VCC seemingly related to the serial activity or the flashing. The ripple ranges from 3.248V - 3.402V (idle voltage is 3.310V - 3.330V). I also tried measuring the EVDD and DEVDD pins on the rfr2 directly, which shows the same ripple (give or take a few mv).

When I provide extra load on the 3V3 pin (by adding 3x100ohm resistance for about 100mA load), the peaks in the ripple are reduced a bit, it is now 3.248V - 3.303V. J3.248V - 3.402V on 3V3 header, EVDD and DEVDD pin

I also measured the AVDD and DVDD pins, which connect to internally regulated voltages on the rfr2. AVDD was 0V, presumably because the RF hardware is turned off. DVDD showed a bit of rio pple (1.767V - 1.781V) but that seemed like random noise, without any repetitive pattern.

Here's an annoted capture of the serial pins, two debugging pins and the 3V3 pin.

upload timeout

As you can see the RX pin shows an address command, followed by data, both of which are acked by the bootloader. While it is processing a data packet, it drives PE2 high, and while it is waiting for the SPMEN bit to clear it toggles PE3.

At the end of the trace, you can see it receives a data packet, enters the busywait loop, but then PE2 is never cleared and no reply is sent.

In this trace it seems like the ripple is more caused by the start/end of serial transfer than the actual flashing process, so I made another dump of the initial handshake (which involves serial, but not flashing). As you can see below, it seems that just serial also creates similar ripples (the peak is 15mV lower, though). untitled

matthijskooijman commented 9 years ago

And this morning I tried to reproduce the problem on a Arduino Mega 2560 (running our bootloader minus RF stuff), but it ran for dozens of uploads without any timeouts.

matthijskooijman commented 9 years ago

And one more test: Disabling the RF stuff (wibo and uracoli) and the LED code (as needed on the mega2560) doesn't fix the problem, I can still trigger the timeout then.

amcjen commented 9 years ago

@matthijskooijman Do you have a frequency counter? If so, what are you seeing on the 16U2 crystal output? Is it a solid 16MHz with a low ppm? (I'll check to see if we have one here at R/GA)

amcjen commented 9 years ago

(BTW, you must be using the new Saleae Logic+Scope product? Amazing product!)

matthijskooijman commented 9 years ago

I can probably measure the frequency with the Logic8 I have. However, I'm not sure what you're looking for - it's the bootloader that is misbehaving, the 16u2 just seems to be an innocent bystander here.

eeintech commented 9 years ago

@matthijskooijman It seems that there are quiet a lot of hickups on VCC when TX/RX traffic starts. Have you tried bypassing the TI 3.3V regulator with a clean and stable 3.3V rail? Another way to fix it might be to add ~1mF cap at the output of the regulator.

matthijskooijman commented 9 years ago

Eric tested on the scout he has with a more stable regulator, that didn't help at all. I think I tried running with a big cap on the 3V3 line, but I'm not sure, so I'll try again. Putting an actual external 3.3V on that line is a bit tricky, because the line isn't available on any header pin (ICSP pads are probably the best access) and I'd need to bypass USB power into the regulator...

matthijskooijman commented 9 years ago

In earlier testing, I found that flashing a hex file with random data would usually (or even always?) work, but then a subsequent upload attempt would fail right away (avrdude never got any reply at all). I suspected this was caused by the bootloader overwriting itself and that the cause for this was the same as for the timeout, but it turns out neither are true - reading back the flash shows the bootloader is untouched. This problem was caused by my simplification of the bootloader, I removed some code that wasn't essential to the timeout (e.g. led handling and unused stk500 commands). It turns out that the bootloader timeout depends on the led handling.

The timeout is 10,000 iterations of a loop that contains a 1us delay, which would result in 10ms of timeout (avrude seems to typically need 200-300ms for its first command). However, the led code contains if ((boot_timer % _BLINK_LOOP_COUNT_) == 0) and since _BLINK_LOOP_COUNT_ is not a power of 2, that division takes a long time (just under 40 us). This results in a timeout of 10,000 * 40us = 400ms, which works. So, when I removed the LED code, I also broke the timeout.

The reason the first upload did work, is that, when the first flash word is 0xffff, the bootloader restarts instead of starting the application, so it kept listening on serial instead of actually timing out.

I had also found that flashing all zeroes would always make avrude recover after the timeout. When the timeout happens because of a random jump into the program, I think the zeroes (together with any 0xff's left from erasing the flash when burning the bootloader) are just nop instructions, or at least instructions that do not cause any looping. This means that after a short while, all of the flash has been executed, and the bootloader starts again. If the timing is right, this allows avrdude to resync and continue where it left off. This doesn't work when writing a real sketch, since that sketch would then start running, and never run all the way up to the bootloader again.

So, neither of these things seem to be actually related to the timeout, unfortunately. They do point out that the timeout handling in the bootloader could do with a bit of a cleanup...

matthijskooijman commented 9 years ago

This suggests that the program counter got reset to 0x0 at some point.

Looking at the contents of the flash afterwards, it seems 0x0 contains a jmp 0xfaf2 instruction, so jumping to 0x0 does not explain this. In fact, the entire interrupt vector contains distant jumps, so you can only and up at 0x16e if you jump to somewhere > 0x130.

I have one more datapoint that contradicts this. When filling the flash with rjmp .-2 instructions (e.g. infinite loops), after the timeout happens I can see the PC is (and stays) 0x0. I've observed this using my JTAGICE3 debugger, I haven't tried wiring up the timer interrupt again to compare.

For reference, this is what I used to generate the rjmp instructions:

bytes=$((248 * 1024)); for ((i=0; i<$((bytes/2)); i++)); do echo -ne '\xff\xcf'; done > jmp.bin                                                                 
matthijskooijman commented 9 years ago

Very useful new info: If I modify the bootloader to map writes to the first half of the flash into the second half, the timeout happens a lot sooner (saw 4%, 6% and 7% just now). This rules out the "timeout happens after some time due to charge buildup / drift / whatever" and confirms that the problem is really related to the flash location (probably to the second half of the flash, but that's not entirely confirmed here).

matthijskooijman commented 9 years ago

More importantly, it reduces the time of each test run by 30 seconds or so :-p

matthijskooijman commented 9 years ago

And one more interesting observation: If I modify a register between case CMD_PROGRAM_FLASH_ISP: and boot_page_erase(address);, the problem stops occurring altogether. It's erratic though - If I put it just before the erase, but comment out the santity check before it, the timeout happens again (but a bit later, it seems).

It doesn't seem the register itself matters - I saw this with the PORTB, PORTE and GPIOR0 register. The latter suggests that it's not any (power) effects of modifying the register, it seems to be just the act of accessing the register file. The actual operation doesn't seem to matter, =, |= and ^= have all worked (generating varying instruction sequences, ranging from a single sbi to a in, ldi, eor, out sequence).

It doesn't seem to be timing related - inserting nops equivalent to the register-modifying instruction in the same place does not seem to work.

Removing the else branch of the address sanity check if seems to also prevent the problem, presumably because the structure of the code changes. Looking at the dissassembly indeed shows significant changes. Surprisingly, there is also an unexpected amount of change from just inserting a PORTE |= 1; line, which suggests that it's not the register access itself that influences the behaviour, but the code layout changes resulting from it (perhaps resulting from a volatile variable access?).

For reference, here is the huge assembly diff from adding that one line: https://gist.github.com/matthijskooijman/887c7a328f6cb3b7a225

matthijskooijman commented 9 years ago

A bit of closer inspection shows that, because I removed the EEPROM command from the switch, the if ( msgBuffer[0] == CMD_PROGRAM_FLASH_ISP ) is no longer needed and optimized away. However, when the (volatile) register write is present, this check is no longer optimized away, resulting in massive changes. It's mostly changes in register allocation (and some related instructions, e.g. sbi + sbc instead of sbiw since sbiw doesn't work on all registers) and some code layout changes. None of the changes seem to suggest any problem, though...

matthijskooijman commented 9 years ago

Sidestepping from the previous "it stops breaking when I press here" observation, I further reduced the bootloader code to a minimal program that just flashes zeroes into the upper half of the application over and over again. No serial is involved, so it's not really a bootloader anymore, but it's enough to reproduce the problem. Only erasing didn't show the problem, so it seems that both erasing and programming is needed.

I've added some debug output to port B (active low leds, blue on PB4, red on PB5, green on PB6) and PORTE (for my logic analyzer). When you run it, it blinks the red and blue led in turn. This should continue indefinitely, but after a while (usually within 10-15 seconds), it stops blinking and the green led lights up (usually together with the red led, though I think I've also seen it with the blue led lit). This indicates the bootloader has restarted, without any reset reason listed in MCUSR.

The code is here, in the bootloader-debug branch: https://github.com/Pinoccio/hardware-pinoccio/blob/e201a7af3f16578c76e23e28ac86e361229c986d/firmware/bootloader/src/main.c

There is a Makefile in the parent directory, but for reference, these are the compiler options used:

avr-gcc -DF_CPU=16000000UL -D_PINOCCIO_256RFR2_=1 -Dpinoccio -D_SW_VERSION_=5 -DWIBO_FLAVOUR_BOOTLUP=1  -Os -fdata-sections -fpack-struct -fshort-enums -g3 -Wall -pedantic -mmcu=atmega256rfr2 -std=gnu99 -fno-strict-aliasing -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -mrelax -fno-jump-tables -MMD -MP -c -o src/main.o src/main.c
avr-gcc -DF_CPU=16000000UL -D_PINOCCIO_256RFR2_=1 -Dpinoccio -D_SW_VERSION_=5 -DWIBO_FLAVOUR_BOOTLUP=1  -Os -fdata-sections -fpack-struct -fshort-enums -g3 -Wall -pedantic -mmcu=atmega256rfr2 -std=gnu99 -fno-strict-aliasing -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -mrelax -fno-jump-tables -Wl,-Map,bootloader.map -Wl,--gc-sections -Wl,--section-start=.text=0x3E000 -Wl,--section-start=.bootlup=0x3FD00 -mmcu=atmega256rfr2 -Wl,--relax -o bootloader.elf src/main.o 

Using this code, I made a logic analyzer trace.

This first image shows the start of the trace, showing the PE2 assertion at startup and then alternating between pulling PE4 low during page erase and pulling PE5 low during page write. In between (visible when you zoom in closer), both pins are high while filling up the page buffer. This continues exactly like this, until the next trace.

screenshot from 2014-12-31 17 34 03

This shows the end of the trace - during the page write (PE5 is low) execution somehow never passes the point where it sets PE5 high again and instead ends up at the start of the bootloader (presumably it ends up at 0x0 or somewhere else in the application section and then runs up to the bootloader again, but I did not verify this for this particular test). As expected from the code, the trace doesn't change anymore after this.

screenshot from 2014-12-31 17 33 52

Still suspecting the powersupply, I tried connecting a 100uF cap to VCC, which didn't help. It did seem to soften the voltage changes a bit - A bit more curvy instead of spiky (changes are not as pronounced as the traces I posted previously, min/max values measured were in the 3.15 - 3.32V range). However, I manually pushed the cap's pin against the ICSP VCC pad, so perhaps actually soldering something to the pad, or even directly to the AVR VCC pins might yield better results...

matthijskooijman commented 9 years ago

For reference, here's the assembly generated for the code: https://gist.github.com/matthijskooijman/e0268c9d222aa72001f0

matthijskooijman commented 9 years ago

I finally managed to try this on a Zigbit module, and the problem also occurs there. This should rule out any hardware problems with the Pinoccio board and indicate that it's really a problem in the 256rfr2 chip. I used the "ATmega256RFR2 ZigBit Xplained Pro Extension" board (ATZB-256RFR2-XPRO-ND).

The setup is simple: I connected power to J4 and my logic analyzer to pins 1-4 of J3 (which map to PF0 - PF3). I also connected the logic analyzer ground to a spare ground pin:

sam_4214 sam_4215

For power, I tried both the 3.3V pin of a Pinoccio board, as well as an Arduino Mega (to rule out problems with the Pinoccio's power supply).

Here's trace of the output, just like above:

screenshot-resized

This trace was generated using this code: https://raw.githubusercontent.com/Pinoccio/hardware-pinoccio/20f636963d38e7681adf2ec3dc081486412139e4/firmware/bootloader/src/main.c

To get the code onto the board, here's the steps to take:

wget https://raw.githubusercontent.com/Pinoccio/hardware-pinoccio/20f636963d38e7681adf2ec3dc081486412139e4/firmware/bootloader/src/main.c
avr-gcc -Wl,--section-start=.text=0x3E000 -Os -mmcu=atmega256rfr2 main.c -o test.elf
avr-objcopy test.elf -O ihex test.hex
avrdude -p m256rfr2 -c jtag3 -U hfuse:w:0x10:m -U lfuse:w:0xDE:m -U efuse:w:0xFE:m
avrdude -p m256rfr2 -c jtag3 -U flash:w:test.hex

Note that -Os is required, without it the problem does not seem to trigger (presumably because timing is essential).

When running this code, the yellow and green leds should blink alternatingly (to fast to notice). If it breaks (that is, main is re-executed a second time, without MCUSR being set), the led led will turn on and typically the yellow led remains on, indicating that it went rampant during the writing of the page.

For completeness, here's the dissassembly of the code, which confirms that there is no way that execution should be able to restart:

Disassembly of section .text:

0003e000 <__vectors>:
   3e000:   0d 94 9a f0     jmp 0x3e134 ; 0x3e134 <__ctors_end>
   3e004:   0d 94 a6 f0     jmp 0x3e14c ; 0x3e14c <__bad_interrupt>
   [snip more __bad_interrupt jumps]
   3e130:   0d 94 a6 f0     jmp 0x3e14c ; 0x3e14c <__bad_interrupt>

0003e134 <__ctors_end>:
   3e134:   11 24           eor r1, r1
   3e136:   1f be           out 0x3f, r1    ; 63
   3e138:   cf ef           ldi r28, 0xFF   ; 255
   3e13a:   d1 e8           ldi r29, 0x81   ; 129
   3e13c:   de bf           out 0x3e, r29   ; 62
   3e13e:   cd bf           out 0x3d, r28   ; 61
   3e140:   01 e0           ldi r16, 0x01   ; 1
   3e142:   0c bf           out 0x3c, r16   ; 60
   3e144:   0f 94 a8 f0     call    0x3e150 ; 0x3e150 <main>
   3e148:   0d 94 03 f1     jmp 0x3e206 ; 0x3e206 <_exit>

0003e14c <__bad_interrupt>:
   3e14c:   0d 94 00 f0     jmp 0x3e000 ; 0x3e000 <__vectors>

0003e150 <main>:
   3e150:   80 b3           in  r24, 0x10   ; 16
   3e152:   8f 60           ori r24, 0x0F   ; 15
   3e154:   80 bb           out 0x10, r24   ; 16
   3e156:   56 9a           sbi 0x0a, 6 ; 10
   3e158:   9a 9a           sbi 0x13, 2 ; 19
   3e15a:   6a 9a           sbi 0x0d, 2 ; 13
   3e15c:   84 b7           in  r24, 0x34   ; 52
   3e15e:   81 11           cpse    r24, r1
   3e160:   03 c0           rjmp    .+6         ; 0x3e168 <main+0x18>
   3e162:   5e 98           cbi 0x0b, 6 ; 11
   3e164:   89 9a           sbi 0x11, 1 ; 17
   3e166:   ff cf           rjmp    .-2         ; 0x3e166 <main+0x16>
   3e168:   5e 9a           sbi 0x0b, 6 ; 11
   3e16a:   a2 9a           sbi 0x14, 2 ; 20
   3e16c:   72 9a           sbi 0x0e, 2 ; 14
   3e16e:   88 9a           sbi 0x11, 0 ; 17
   3e170:   14 be           out 0x34, r1    ; 52
   3e172:   80 e0           ldi r24, 0x00   ; 0
   3e174:   90 e0           ldi r25, 0x00   ; 0
   3e176:   a2 e0           ldi r26, 0x02   ; 2
   3e178:   b0 e0           ldi r27, 0x00   ; 0
   3e17a:   13 e0           ldi r17, 0x03   ; 3
   3e17c:   01 e0           ldi r16, 0x01   ; 1
   3e17e:   20 e0           ldi r18, 0x00   ; 0
   3e180:   30 e0           ldi r19, 0x00   ; 0
   3e182:   45 e0           ldi r20, 0x05   ; 5
   3e184:   f4 2e           mov r15, r20
   3e186:   a2 98           cbi 0x14, 2 ; 20
   3e188:   8a 98           cbi 0x11, 2 ; 17
   3e18a:   fc 01           movw    r30, r24
   3e18c:   a0 93 5b 00     sts 0x005B, r26
   3e190:   10 93 57 00     sts 0x0057, r17
   3e194:   e8 95           spm
   3e196:   07 b6           in  r0, 0x37    ; 55
   3e198:   00 fc           sbrc    r0, 0
   3e19a:   fd cf           rjmp    .-6         ; 0x3e196 <main+0x46>
   3e19c:   8a 9a           sbi 0x11, 2 ; 17
   3e19e:   a2 9a           sbi 0x14, 2 ; 20
   3e1a0:   ac 01           movw    r20, r24
   3e1a2:   bd 01           movw    r22, r26
   3e1a4:   c0 e0           ldi r28, 0x00   ; 0
   3e1a6:   d1 e0           ldi r29, 0x01   ; 1
   3e1a8:   09 01           movw    r0, r18
   3e1aa:   fa 01           movw    r30, r20
   3e1ac:   60 93 5b 00     sts 0x005B, r22
   3e1b0:   00 93 57 00     sts 0x0057, r16
   3e1b4:   e8 95           spm
   3e1b6:   11 24           eor r1, r1
   3e1b8:   4e 5f           subi    r20, 0xFE   ; 254
   3e1ba:   5f 4f           sbci    r21, 0xFF   ; 255
   3e1bc:   6f 4f           sbci    r22, 0xFF   ; 255
   3e1be:   7f 4f           sbci    r23, 0xFF   ; 255
   3e1c0:   22 97           sbiw    r28, 0x02   ; 2
   3e1c2:   20 97           sbiw    r28, 0x00   ; 0
   3e1c4:   89 f7           brne    .-30        ; 0x3e1a8 <main+0x58>
   3e1c6:   8b 98           cbi 0x11, 3 ; 17
   3e1c8:   72 98           cbi 0x0e, 2 ; 14
   3e1ca:   fc 01           movw    r30, r24
   3e1cc:   a0 93 5b 00     sts 0x005B, r26
   3e1d0:   f0 92 57 00     sts 0x0057, r15
   3e1d4:   e8 95           spm
   3e1d6:   07 b6           in  r0, 0x37    ; 55
   3e1d8:   00 fc           sbrc    r0, 0
   3e1da:   fd cf           rjmp    .-6         ; 0x3e1d6 <main+0x86>
   3e1dc:   8b 9a           sbi 0x11, 3 ; 17
   3e1de:   72 9a           sbi 0x0e, 2 ; 14
   3e1e0:   ac 01           movw    r20, r24
   3e1e2:   bd 01           movw    r22, r26
   3e1e4:   5e 5f           subi    r21, 0xFE   ; 254
   3e1e6:   6f 4f           sbci    r22, 0xFF   ; 255
   3e1e8:   7f 4f           sbci    r23, 0xFF   ; 255
   3e1ea:   42 30           cpi r20, 0x02   ; 2
   3e1ec:   50 4e           sbci    r21, 0xE0   ; 224
   3e1ee:   63 40           sbci    r22, 0x03   ; 3
   3e1f0:   71 05           cpc r23, r1
   3e1f2:   20 f4           brcc    .+8         ; 0x3e1fc <main+0xac>
   3e1f4:   9f 5f           subi    r25, 0xFF   ; 255
   3e1f6:   af 4f           sbci    r26, 0xFF   ; 255
   3e1f8:   bf 4f           sbci    r27, 0xFF   ; 255
   3e1fa:   c5 cf           rjmp    .-118       ; 0x3e186 <main+0x36>
   3e1fc:   80 e0           ldi r24, 0x00   ; 0
   3e1fe:   90 e0           ldi r25, 0x00   ; 0
   3e200:   a2 e0           ldi r26, 0x02   ; 2
   3e202:   b0 e0           ldi r27, 0x00   ; 0
   3e204:   c0 cf           rjmp    .-128       ; 0x3e186 <main+0x36>

0003e206 <_exit>:
   3e206:   f8 94           cli

0003e208 <__stop_program>:
   3e208:   ff cf           rjmp    .-2         ; 0x3e208 <__stop_program>
matthijskooijman commented 9 years ago

Daniel Wax from Atmel has confirmed that this is actually a bug in the hardware. He writes:

This is actually a recently discovered issue that has not yet been documented.

There is a possibility of problems when reading from bootloader section while writing to flash.

So from a high level the workaround is to force the mcu to not read while programming the flash. After issuing the write, put MCU to sleep with only wake up source as SPM_READY

This only effects ATMEGA64/128/256 RFR2 and 4RFR2 version. Just this one family

Unfortunately, there is nothing that can be done from outside to help this. Essentially the entire MCU flash (including the boot section) should be treated as inaccessible after the write operation is initiated. This does not depend on any external timings.

gvdw commented 9 years ago

Great work Matthijs, this has been a real weird issue but it's good to get confirmation of where the problem lies and a way to mitigate. Unfortunately, it means existing bootloaders are susceptible and will need to be updated if people experience issues.

I've been testing Daniel's fix and so far have flashed a full sketch over 100 times without any issues. We have seen variation in the past where the chip will work OK for 10+ flashes but hitting the hundred mark makes me quite optimistic.

Have made a new branch for testing changes, really nothing to it at this stage but helps keep track: https://github.com/Pinoccio/core-pinoccio/tree/spm-bugfix

matthijskooijman commented 9 years ago

Nice, thanks for confirming! As for committing the fix - I was planning to move the bootloader into this repo instead of core-pinoccio. I still have to commit my work to master properly (this issue has sidetracked me), but the code is already in the debug branch. Could you commit your fix on top of https://github.com/Pinoccio/hardware-pinoccio/commit/85ecda8d14969d6ea84ced622676f2a2cc9f57b7 in this repo and push that to a branch? That commit includes a cleaned up file layout, with a single Makefile for compilation.

matthijskooijman commented 9 years ago

Quickly looking at your code, it looks good. One remark though - I think the sei() should be exactly before the sleep instruction, to prevent a race condition (given that we can be certain that the spm instruction will take a bit of time, I don't think it's really going to happen here, but in general sei just before sleep is the recommended practice. This works becuase sei() guarantees that the next instruction will be executed before any interrupts fire, guaranteeing the SPM interrupt will wake us up, even if it is already pending when you go to sleep).

gvdw commented 9 years ago

your 85ecda8 commit has a file "datasheets\RF\2450FB15L0001 Balun:Filter.pdf" - the colon is an illegal character under windows so it is forcing me to delete the file if I commit. And for some reason I am getting 403 denied trying to push it under Debian. must be a sign I should go to bed, I'll try again tomorrow (but I'm happy for you to put the changes in the right place if you're keen). Good call on sei() too, let's put it just before the sleep.

matthijskooijman commented 9 years ago

@gvdw, I added your fix to https://github.com/Pinoccio/hardware-pinoccio/commits/spm-fix. I moved the sei() and also shuffled some other code around in a second commit, does that look ok to you? I haven't actually tested this code yet, though, I'll do that ASAP.

I've also removed the colon from that filename, hope that works better for you :-)

gvdw commented 9 years ago

thanks Matthijs :) your changes look good at a quick glance, I won't get a chance to test it for a couple of weeks but will revisit then.