Badge "stuck" - Githubissues

OrenLederman commented 6 years ago

Hi Michael, one of the badges with the new firmware just got "stuck" after two days, It's the first time I see something like that with the new firmware. I restarted by hub (just to make sure it's not hogging the connection or something), but it doesn't seem like it.

The badge doesn't accept connections, and doesn't show on scans, which is a bit weird. I checked with a multi-meter, and the battery is ok, and the MCU receives power.

Anything I can do besides turning it off and on?

Thoughts on what it could be: A fun memory leak New fun bug in the clock setting function EEPROM/Flash issue What do you think?

OrenLederman commented 6 years ago

Hi Oren,

Ok, that is not good. .

That it doesn’t shows up on scans actually means, that it the chip is completely stuck (not just a while-true loop because BLE-Stack should be independent of this).

So it shouldn’t be an issue while communicating with EEPROM, because this is independent of the BLE-Stack advertising.

I actually don’t think that it has sth to do with the new clock setting function, because such a behavior should only happen if we divide by zero or sth like that (and in the new clock-setting I actually check that this couldn’t happen).

We should also have enough RAM for all functions to be called, so I also don’t think that RAM is a problem.

What I think the reasons could be:

It has sth to do with the too fast transmission of the data (the 5ms timer period). Can you probably check when the badge died? Was it during a connection, or during normal run? (Check by having the logs when the hub connects to the badge, and correlate it with the scan data of the surrounding badges if the died badge appears in it?) Problem when writing to flash. When this is true, and you restart the badge, it should actually try to store again to the same flash-address and would get stuck again.

Currently I don’t really have an idea what to do beside turning it off and on again.

Michael

OrenLederman commented 6 years ago

yup, not good. thus- oh oh :)

Based on the logs, the lat time the hub communicated with it, it connected, pulled data, and successfully disconnected. I'll try to restart and see what happens.

Another though - what do you do with "Exceptions" or uncaught errors? I think that the old firmware restarted itself (I remember having discussions with andrew or angus about it). What does the new firmeware do?

OrenLederman commented 6 years ago

ok, I restarted the wrong badge first, and then our badge :P

But before resetting, I found something interesting - when checking the voltage on the MCU VCC pin, I noticed that every 10 seconds the voltage goes down (2.7v->2.6v) for a second or two, and then goes up. I don't see this behavior it in the other badges. Voltage drop with tehse batteries means high current draw, and the only thing that could possibly pull so much current would be EEPROM/Flash.

So I restarted the badge... and things got even more weird. I can still see this behavior (at first I thought it only happens when I connect to pull data, but that's not it). Every several seconds the voltage drops for a couple of seconds and go back up. You can see the badge in scans, the hub successfully communicates with the badge. But... no data is received (no audio, no proximity). In fact, the hub "times out" while waiting for data, which means that we didn't get a "no data" response (empty header). It doesn't reset by itself though between attempts (I can see that the clock isn't reset).

so... weird. but here are some thoughts: High current draw might indicate an intense scan of the eeprom/flash (I wish I knew how to use the logic analyser and record what's going on there) This means either a bug in the implementation that causes it to scan/erase the entire thing (looking for a start header? losing it's way? erasing a bunch of pages?) Or a hardware failure Thoughts of next steps: Take a look at your code and see if you can identify a loop or conditions that may cause it Load the code with debug on (but won't it overwrite some of the data because of the larger code?) Run the tester code? (but again, will kill the data) Run a specialized code that will dump the flash and eeprom data to a file so you can try and analyse it Mailing you the badge :) What do you think?

OrenLederman commented 6 years ago

OK this is interesting..

I think you should flash with debug, and check what is going on in the DEBUG commands.

Because EEPROM is the first storage and flash the second one, the data for scan and mic shouldn’t be overwritten because of their partition size. (You are using the settings for the partitions that I made, right?)

So we should actually see this behavior again.

OrenLederman commented 6 years ago

But don’t use the tester, because it will erase the header of the partitions, and then we have to wait very long…

OrenLederman commented 6 years ago

Ah! I think I found part of the bug. I used the terminal to connect to the badge, and asked for audio and proximity data from the last couple of minutes - that worked. So I tried to ask for older and older data. So it means that the badge IS recording data (hooray!)

The proximity data was more forgiving, but not perfect - even when I asked for data from 16000 seconds ago it worked, and just sent me data form the last ~4.4hours. But when i asked for data since the last 25000 seconds, it kicked me out. And when I asked for data from 30000 seconds ago it sent me some data and then some malformed data.

The audio data was worse. Asking for data 2500 seconds ago already kicked me out (disconnected).

The big surprise - I get the same behavior from other badges with the new firmware.

So... there might be several bugs hiding here:

I have no idea why the badge got stuck
Asking for data from a long time ago pisses off the new firmware. It might explain so of the long current pulls (not all.. since i've seen this when no one was communicating with the badge). But what's happening here?
- bug when data that is very old (doesn't exist in the badge?) is requested
- maybe the connection timeout is too short and it kicks me out
- perhaps it draws so much current in a short period of time that it resets something/fails the eeprom/flash
I still don't know why I see a drop in voltage every now and then for this badge

OrenLederman commented 6 years ago

As for flushing it with teh debug code - I'm using your code. If you set the params there, that's what I'm using :) But, won't the header get overwritten, and thus all the data? (since we move the first page of the data to be somewhere else)

MichaelHopfengaertner commented 6 years ago

One more question: The “kicking out” happened when you used the terminal.py for communication, right? Because I remember when I used the terminal, the time to type the pull-request into the terminal took some time, and so the badge hadn’t enough time to go back, because currently I reset the timeout only when I send something, not when I receive sth.

OrenLederman commented 6 years ago

Right, in the terminal. As I said, it might be a timeout issue. (the older badges don't seem to hit this timeout when I ask for old data. Just saying ;) )

If it is because of the timeout, it's fine to increase the value

MichaelHopfengaertner commented 6 years ago

I found that this should be a timeout issue (see https://github.com/MichaelHopfengaertner/openbadge/issues/2).

The remaining question is, why the badge got stuck, and why there is this strange current-consumption. My guess is the following: The flash cells are defect, and we can't store to them. The filesystem writes an header for each element, and verifies if that header was written successfully (by reading and comparing it again). When this is not true the filesystem will return an error. The processing-module that is responsible for the storing, retries it by rescheduling the storage process again (currently this is done infenitely, but was on my TODOs to have a counter for retries). I can imagine, but I'm not sure, that this periodical flash erase/writing could also influence the BLE, because the flash-handling as well as the BLE-stuff is done by the SoftDevice, and probably it gets somehow confused when we permanently try to interact with the flash...

To check if this is true, we need to somehow know what is going on in there.. The problem with the debug-software, as you already mentioned, is that it is larger than the no-debug firmware. And probably the "dead" flash cells aren't anymore in the partitions, and then we couldn't observe the problem. So what I'm going to do is the following: I'll implement a version, where the behaviour of the storage is signalised by the LEDs (and not by Serial debug commands) so that the code-size is not increased so much and we hopefully get the "dead" flash cells again.

MichaelHopfengaertner commented 6 years ago

Hi Oren, I created a new branch (https://github.com/MichaelHopfengaertner/openbadge/tree/features/issue1) for this issue. Please use this version with PROTOCOL_VERSION=1 and debug on, to get some more information about the behavior of the badge.

If possible please log the output to a file that you can send me.

Thanks!

MichaelHopfengaertner / openbadge

Badge "stuck" #1