EEPROM.commit() not working if core1 is in use (sometimes?)

JonnyHaystack commented 1 year ago

I'm having a very similar (perhaps identical) issue to #1561 and #1479. I'm using the EEPROM library exactly as documented and I'm not doing anything weird like manually idling core1 or disabling interrupts. I dug deeper into where it gets stuck and it looks like core1 isn't going into its fifo._irq(), so core0 gets stuck in this loop.

I am not manually calling noInterrupts() or rp2040.idleOtherCore() at any point in my program. I did test manually calling rp2040.idleOtherCore() at another point in the program though, and it didn't hang. Like @minisbett, I was not able to reproduce the issue in a completely minimal program. It seems that possibly the more work I do on core0, the more likely the issue is to occur. This is very confusing, because surely doing more work on core0 should not cause core1 to enter a stalled state, where its IRQ handler doesn't run...? I'm also seeing nanopb inexplicably failing to both encode and decode messages only when core1 is active, despite the fact that there isn't even anything timing-based going on there. There is no concurrent access going on either. My setup1 and loop1 literally look like this:

void setup1() {
    busy_wait_us(1);
}

void loop1() {
    busy_wait_us(1);
}

I've done so much experimentation over more than 12 hours and so far it defies explanation... Only thing I can guess at is maybe too much allocated on the stack or something (with all these whole-EEPROM-sized (4096 bytes) buffers being used), causing everything to fall apart, but I have no idea how to check that. I don't see how it can be any kind of deadlock either, because as I have shown, core1 is not trying to access any resources. It seems to just stall on its own...

At one point I've even tried wrapping a pb_decode() in rp2040.idleOtherCore() and rp2040.resumeOtherCore(), and that stops the decode from failing, but then the rp2040.idleOtherCore() called inside EEPROM.commit() right after that causes it to hang 😢

    rp2040.idleOtherCore();
    Config config = Config_init_zero;
    pb_istream_t istream = pb_istream_from_buffer(&buffer[1], len - 1);

    // Works if core1 is idled or if setup1/loop1 are undefined.
    if (!pb_decode(&istream, &Config_msg, &config)) {
        char errmsg[] = "Failed to decode config";
        WritePacket(CMD_ERROR, (uint8_t *)errmsg, sizeof(errmsg));
        return false;
    }
    rp2040.resumeOtherCore();

    // save_config calls EEPROM.commit, which hangs on rp2040.idleOtherCore(), despite that function not hanging just above
    if (!persistence::save_config(config)) {
        char errmsg[] = "Failed to save config to memory";
        WritePacket(CMD_ERROR, (uint8_t *)errmsg, sizeof(errmsg));
        return false;
    }

    WritePacket(CMD_SUCCESS, nullptr, 0);
    return true;

Even as I'm writing this I tested again and it's changed back so calling rp2040.idleOtherCore() here hangs as well and removing it the pb_decode() works again. Edit: actually it seems to fail the decode the first time it receives the message after boot, but on a subsequent run of this function it decodes correctly, but hangs in the EEPROM bit instead. To be clear though, neither of these problems ever happen if setup1() and loop1() are removed, which should absolutely not have any effect on this.

This bug is highly unpredictable and can seemingly affect just about everything. I don't think this can be attributed to mere user error (would love if that was the case though), and it seems to have been experienced by multiple users now (#1561, #1479) without any proper resolution/explanation being found.

It seems @myklemykle had managed to write a minimal program to reproduce their issue, but they deleted the github repo so I can't see what that was.

earlephilhower commented 1 year ago

Sorry, without an MCVE there's nothing to be done here.

Some points that may help you in your debug:

The 2 cores share 8K of stack with no guards between them
- If Core0 goes > 4K on-stack in your code then Core1 will be corrupted
- If you have some way of causing a fault immediately when Core0 > 4K stack, a PR would be welcome!
EEPROM data is on the heap, not the stack, so that's not material
If you manually call the idleOtherCore/resumeOtherCore you better disable IRQs in the proper order, too, around it.
- Deadlock is easy to see if you hook up a PicoProbe

JonnyHaystack commented 1 year ago

Thanks Earle for the quick reply! 😃

Some points that may help you in your debug:
* The 2 cores share 8K of stack with no guards between them

  * If Core0 goes > 4K on-stack in your code then Core1 will be corrupted

* EEPROM data is on the heap, not the stack, so that's not material
This is very useful info and seems like a likely culprit! By 4096-byte buffers, I meant the buffers that I am using in a few functions to read/write data, which I know will be at most 4096 bytes. Is there any general best practice/rule of thumb you would suggest to avoid this? I was allocating these on the stack because my habit is to limit variables to the scope in which they are needed (in order to save RAM etc), but I hadn't considered that I could overflow the stack so easily. Now that I know that's most likely the problem here it should be easy to fix, but I'd like to make sure I'm following all the best practices and avoid this kind of thing happening in the future.

Also, is this info about the stack documented anywhere? It seems likely to me that it may have been the culprit in #1561 and #1479 as well, and I'd say it's an important thing to be aware of. Maybe a note in here would be good? https://arduino-pico.readthedocs.io/en/latest/multicore.html

  * If you have some way of causing a fault immediately when Core0 > 4K stack, a PR would be welcome!
I certainly don't at this point as I didn't know this was the problem, but that sounds like it would be a great feature to have. I'm guessing if you overrun the entire 8K it already causes a fault?
* If you manually call the `idleOtherCore`/`resumeOtherCore` you better disable IRQs in the proper order, too, around it.
Just curious, if this is always necessary when using these functions, is there any particular reason not to have noInterrupts() and interrupts() called within idleOtherCore() and resumeOtherCore()?
  * Deadlock is easy to see if you hook up a PicoProbe
I do have a Picoprobe setup. How would I check for deadlock specifically? Though that doesn't seem to be my problem anyway.

earlephilhower commented 1 year ago

The stack is the same as the Pico SDK setup, so I guess I assumed it was kind of obvious. My rule of thumb is anything larger than 128 bytes should probably be on the heap. There are certain GCC options you can add to warn when you have stack allocations above your chosen size.

WRT noInterrupts that's really an app setup thing. It's not specifically needed, but if there is a possibility your IRQ could go a long time (and the other core is doing something that can't be frozen for the entirety of it), or could try and get, say, a mutex held by the frozen other core, then you obviously need it.

In general there's very little reason for a multicore app to actually need to freeze the other core (we do it here because to write to flash you need to run both cores out of RAM). Mutexes and semaphores are multicore safe and all that's needed in the vast majority of cases.

For deadlock detection, you can just look at where thread 0 and thread 1 are at in GDB. If they're both stuck in a mutex wait or the idleOtherCore bit it's pretty obvious.

JonnyHaystack commented 1 year ago

Okay that's perfect, thanks again. Moving the large buffers to heap solved all the issues I was having so will close this now.

minisbett commented 1 year ago

Okay that's perfect, thanks again. Moving the large buffers to heap solved all the issues I was having so will close this now.

Thanks a lot for bringing this up again with a bit more experience in this topic! I hope to find success with this in mind as well.

earlephilhower / arduino-pico

EEPROM.commit() not working if core1 is in use (sometimes?) #1745