ESP32 Deadlock when posting messages Worker <-> App

Moddable-OpenSource / moddable

Tools for developers to create truly open IoT products using standard JavaScript on low cost microcontrollers.

http://www.moddable.com

1.32k stars 236 forks source link

ESP32 Deadlock when posting messages Worker <-> App #1046

Closed beckerzito closed 1 year ago

beckerzito commented 1 year ago

Build environment: macOS, Windows, or Linux Target device: NodeMCU ESP32

Description Currently, the implementation on the ESP32 platform uses a FreeRTOS queue to implement message service. When the machine queue is full, the machine posting the message has its thread suspended until one slot is available to enqueue the message. If the Application fills the Worker queue and the callback from the Worker fills the Application queue, then the software gets into a Deadlock and the application gets stuck until a power reset.

Steps to Reproduce

Based on the Worker example, implements the following code: main.js
```
import Worker from "worker";
```

trace("hello\n");

function start() { // note: allocation and stackCount are very small - most real workers will require a larger allocation and stackCount let aWorker = new Worker("simpleworker", {allocation: 6 * 1024, stackCount: 64, slotCount: 32});

for(let i = 0; i < 11; i++) {
    aWorker.postMessage({index: i});
}

aWorker.onmessage = function(message) {
    trace(`Main received msg ${message.index}\n`);
}

}

start();

**simpleworker.js**

self.onmessage = function(msg) { trace(Worker received msg ${msg.index}\n);

for(let i = 0; i < 11; i++) {
    self.postMessage({index: i});
}

}


2. Compile and run on the ESP32 platform

**Expected behavior**
It is expected the Application and the Worker print 10 messages each, but instead, the debugger and the application gets stuck as soon as the first message is sent.

**Images**
The image below shows the creation of the message queue of a hardcoded size of "10" messages. 
![image](https://user-images.githubusercontent.com/26304432/221037859-657a1a0a-8d20-40d8-9780-7098c83cdb0a.png)

Below, the `portMAX_DELAY` on the `xQueueSendToBack` during a post message ensures that the task will get suspended long enough waiting for the queue to be freed.
![image](https://user-images.githubusercontent.com/26304432/221038139-5b3d08c9-0a1c-4dee-ad6c-a38715176ae3.png)

**Other information**
Few points to highlight:
1. When the Task Watchdog is enabled, instead of getting stuck, the board gets reset, because neither the App nor the Worker thread reset the task counter.
2. I think that Watchdog is only a palliative solution to avoid getting stuck, but the architecture deserves a better solution when handling the situation.
3. Better solution for that, would be either trigger an exception letting the developer knows that there's no space available and that consequences might happen (losing the message, getting reset, getting stuck ....) or develop an approach where moddable lets the application decides what to do in this situations (maybe there are use cases we want to wait for the queue to be free, in other situations we have to deal differently).
4. The application should be able to define the queue size or we could even use a linked list to let the queue increase as many as needed.

phoddie commented 1 year ago

It sounds like everything is working as expected. The task messaging is based on FreeRTOS message queues which are efficient, well-designed, and portable. It would take a strong motivation to side-step that.

Deadlocks are a real risk in any multi-tasking environment. Either send fewer messages or increase the queue size. We can add a way to configure the queue size from the manifest, if you'd like.

...letting the developer knows that there's no space available and that consequences might happen (losing the message...

This kind of non-determinism is a source of very subtle bugs. You would be trading off a hard-failure for something likely far worse.

tve commented 1 year ago

Hmm, I understand where you're coming from, but aren't you being a bit on the harsh side? postMessage is akin to write and ECMA-419 has various ways to check whether one can write or get an exception if "the device cannot accept the data because its buffers are full". So by your argument the definition of write should be revisited? [ducking now... ;-) ]

PostMessage is a standard method of Worker, so there's not a ton of wiggle room around how it's defined. Tricky. I've been contemplating to run a worker to manage a packet radio. The main task would post packets to TX and the worker would post ACK/NACK as well as incoming messages back. Sounds like this can run into the deadlock. Maybe it's the wrong mechanism for the purpose?

phoddie commented 1 year ago

@tve – This is IPC and IPC is very difficult to get right. It is used by very low-level parts of the SDK, such as native interrupt handlers. There should be a strong motivation for any changes. If that sounds harsh, so be it. ;)

The ECMA-419 analogies, while interesting, aren't relevant as workers aren't ECMA-419, nor are the native Moddable SDK APIs used to implement IPC.

There is a valid argument that the Worker postMessage method should have a safer behavior here, since it would never deadlock on the Web, where memory and processing power are nearly infinite. Addressing that would take a different direction than my understanding of what @beckerzito proposed, however, which focuses on the low-level IPC. As you note, sorting that out while remaining as true as practical to the spec of Workers is likely going to take some careful spec reading. Unlike the Web, resources are limited, so eventually something is going to fail -- out of memory, out of message queue space, etc.

The main task would post packets to TX and the worker would post ACK/NACK as well as incoming messages back. Sounds like this can run into the deadlock. Maybe it's the wrong mechanism for the purpose?

Perhaps. It would be easy enough to avoid by having the worker limit the number of messages it has outstanding posts for. That would also put you in control of deciding what to do if there are more messages than expected, which seems preferable to overwhelming the message queue.

tve commented 1 year ago

The ECMA-419 analogies, while interesting, aren't relevant as workers aren't ECMA-419, nor are the native Moddable SDK APIs used to implement IPC.

We can agree to disagree :-). Whether I'm writing to a UDP socket or posting to a packet radio worker, it's the same thing in my app WRT your argument about non-determinism. It just happens that the two ways to send packets are implemented differently.

It would be easy enough to avoid by having the worker limit the number of messages it has outstanding posts for.

Yup. From a pragmatic point of view I would likely to be coding to have max one or two messages outstanding. Just enough to keep the pipeline busy. So with a buffer of 10 I would be covered. Still, it would be nice to have something like canPost() or a queueLength constant so one doesn't have to hard-code maxQueue=10 in the app. Maybe there's also an opportunity to add a note to the docs?

phoddie commented 1 year ago

We can agree to disagree :-)

Indeed.

Whether I'm writing to a UDP socket or posting to a packet radio worker, it's the same thing in my app...

You can choose to see them as the same thing in your app. But the APIs are defined by two different entities working with two different goals and constraints. They are different for valid reasons.

Still, it would be nice to have something like canPost() or a queueLength constant so one doesn't have to hard-code maxQueue=10 in the app.

Sure. Let's explore that a bit.

canPost() can only give a valid reply at the instant it is called. By the time it returns, the answer might have changed because the receiving task's message queue is not owned exclusively by the sending worker.

A safe semantic is postMessageNonblocking() which would only post if it doing so wouldn't block. That matches the capabilities of xQueueSendToBack in FreeRTOS that implements the queue.

Another perspective is that postMessage on the Web is already non-blocking. Taking that route, postMessage could choose to fail if it would block. To signal that, it would need to throw (using the return value is looking for trouble with Web compatibly). But... I don't think it makes sense to automatically opt-in code to that behavior: blocking in postMessage might be unexpected, but throwing because it would block is even more unexpected.

tve commented 1 year ago

Those options don't look thrilling... Here's 3 from my vantage point:

a maxQueue type of constant uses no code, it's not great, it's not in the WebWorkers API, it kind'a pushes the problem to the app, but at least no code...
the Workers constructor already takes an options param, could perhaps add a queue size in there?
the docs already warn that the Worker implementation is different from the web, could add a note like "the message queue size is very limited and not intended to be used as pending-work queue, applications are guaranteed to be able to queue two pending messages without blocking but may be blocked if they queue more, the same applies to messages queued by a worker"

If it were me, I'd implement the third option...

phoddie commented 1 year ago

Those options don't look thrilling

Why not? Seriously, it really helps if you explain why you do or don't like something. Leaving the rest of us to guess doesn't really move the conversation forward. The postMessageNonblocking() best represents the reality of the underlying host constraint with regard to queue space and blocking. We could expand that to encompass available memory (also a very real concern when posting messages, since some copying is involved) and name that postMessageIf() - which would give the post permission to fail non-fatally for any reason.

Your first two options make the assumption that knowing the queue size is sufficient to always get a safe result. That's not the case because several workers (or other modules, such as a Digital input or Serial) can be posting messages to the same task.

Your third option is easy for me -- just document the constraint. The proposed text is probably usually correct but not strictly always true. I can easily imagine someone (for example, you!) being displeased when running into the reality.

beckerzito commented 1 year ago

Hey guys, great discussion here.

@phoddie I'm not sure if I'm aligned if your thoughts (correct me if I'm wrong, maybe I didn't get all the details), and I understand that these interactions are very complex and non-deterministic, thus: hard to prevent and hard to handle.

Due to that, I think that would be hard to implement a way to handle the full queue situation generic enough for all user use cases to be provided by the SDK. But I think that at least two actions would help a lot:

Let the appliance configure the size: As you proposed, the manifest is enough. And see that I'm not talking specifically for the Worker. My concern is also for the Main machine (so I understand that should be a configuration per machine to be created).
Provide a mechanism to detect that the full queue has happened. Reason: This is a very hard situation to detect and debug (there are no current logs since, as you stated, up to now, there wasn't any expected situation), but I understand that having a critical error (e.g throwing an exception or at least a warning) would force the developer to pay attention to this interaction and maybe rethink the design?

I like the idea of the postMessageIf() to provide a non-blocking interface. But If I understood correctly, that would just avoid posting messages and sleeping the task if the queue is full right? But see that the use cases I need to ensure that the message will be posted in that specific event (I cannot lose the message). In that case, I would prefer to receive an exception in order to either increase the queue size or redesign the module.

I see that many Moddable modules already use the postMessage interface, mainly the ones related to connectivity. So my understanding is that I don't have entire control over the queue (cannot ensure a safe operation only knowing the current available space to determine if it's safe to post or not), because there are other modules posting without checking (and that's fine).

I hope I could be clear and I think you're the best person to suggest a solution here, but I just wanted to let clear my use cases and let you know that the two points I have highlighted above should be enough on my point of view.

Finally, since you mentioned the FreeRTOS implementation and the resource constraint situation, I'd like to understand how is this situation managed on other platforms.

Just taking a very quick look at the code I had the feeling we have a kind of linked list instead of a static allocation. Isn't it?

Thanks

tve commented 1 year ago

Why not? Seriously, it really helps if you explain why you do or don't like something.

Good point, my bad. I had interpreted, evidently incorrectly, that you didn't like them either.postMessageIf is obviously non-standard so now you're kind'a moving from a std interface to a non-standard one. Code in the SDK and library modules has to choose whether to use postMessage vs. the non-blocking version. Using the non-blocking version may not really be an option due to the semantics of these library modules, i.e., they may not have anything reasonable to do if they get an error return. If an application needs to use the non-blocking version to avoid possible deadlock the author needs to inspect any modules used to understand whether they use the blocking postMessage and whether this poses a deadlock risk. Not insurmountable and perhaps not really common.

The reason I believe that knowing the queue size can be sufficient is that it allows me to write a wrapper around Worker that implements whatever semantics I want for a postMessage method the wrapper exposes. That doesn't solve the issue with library modules, however (unless I'm using fancy import remapping stuff?).

The proposed text is probably usually correct but not strictly always true. I can easily imagine someone (for example, you!) being displeased when running into the reality.

I assume it is not strictly true because of running out of memory to copy the message? WRT "displeased" the key to me is "not intended to be used as pending-work queue": I may be displeased at the design goal, but I won't have basis to complain about the implementation.

Speaking of design goal: what is the design goal of postMessage? I do not have any perspective on how Worker is really used in Moddable. Is it intended to be a pending-work queue or is it more for a signal of the form "wake-up and look at your queue which is stored in shared memory", e.g. more the way most DMA engines or devices work where you're managing a queue in memory and set the "run" bit?

Overall I would say that having a queue length silently hard-coded in the implementation and users run into difficult to troubleshoot deadlocks is something worth fixing. I believe that most/all of what has been proposed here can fix the situation with varying tradeoffs.

phoddie commented 1 year ago

@beckerzito - thank you for the notes. A few responses:

...I think that would be hard to implement a way to handle the full queue situation generic enough for all user use cases to be provided by the SDK

The issue here is not that a queue is full. That can happen and clear itself: the sender will briefly block until the receiver catches up. A problem occurs only when the both the sender and receiver block attempting to write to full queues.

The only reliable mitigation is to not block on send when a queue is full. That keeps the sender unblocked so there's no deadline. The question is what to do when the queue is full and there is a message to send. That is a policy question. The Moddable SDK cannot make that decision because the right answer depends on the application. In some applications, the right answer might be to try resending later. In another, the right answer might be to drop the message entirely (late data isn't always useful data). In another applicaiton, it might be unexpected that the message cannot be sent so the right answer is to log a diagnostic and restart.

I like the idea of the postMessageIf() to provide a non-blocking interface. But If I understood correctly, that would just avoid posting messages and sleeping the task if the queue is full right?

The return value would indicate if the message had successfully been posted.

Let the appliance configure the size

I have no objection to making the queue size configurable. A size of 10 has been more than enough for a long time, but there are valid reasons to make it bigger (and perhaps even cases where smaller is fine and would save some memory!). That's easy to do. But, it isn't sufficient for all situations.

Just taking a very quick look at the code I had the feeling we have a kind of linked list instead of a static allocation. Isn't it?

There are two implementations for Espressif hardware. You may have looked at the ESP8266 version, which indeed uses a linked list (because there is no RTOS at all). This is not optimal, but Web Workers are all-but-impractical on ESP8266 so it doesn't matter much. The ESP32 version uses a statically allocated FreeRTOS queue of 10 elements.

Provide a mechanism to detect that the full queue has happened. Reason: This is a very hard situation to detect and debug (there are no current logs since, as you stated, up to now, there wasn't any expected situation), but I understand that having a critical error (e.g throwing an exception or at least a warning) would force the developer to pay attention to this interaction and maybe rethink the design?

There's no efficient, reliable way to detect the deadlock since it depends on the state of two free running tasks. The best we could probably do is set a long timeout in debug builds (5 or 10 seconds) after which an error is logged and/or thrown. That would tell you that the sender is blocked for an unexpectedly long time, which might be a potential deadlock.

phoddie commented 1 year ago

@tve, thanks for the additional thoughts.

postMessageIf is obviously non-standard so now you're kind'a moving from a std interface to a non-standard one.

It is non-standard. There's no provision in the standard API to address this situation fully, at least as far as I can see. Sticking with a standard API, the only answer is to increase the queue size. That's easy and perhaps sufficient. Embedded development has plenty of levers to configure, so this doesn't fundamentally change the programming model.

I assume it is not strictly true because of running out of memory to copy the message?

That's one. But any test of the queue length has the potential to fail:

Task C: posts a bunch of messages to Task A, filling Task A's queue and getting itself blocked (until Task A dequeues a message)
Task A: Checks queue free space for Task C. Finds there is 1 free entry in its receive queue.
Task A: Decides to call postMessage to Task C
Task B: Before Task A can call postMessage, Task B calls postMessage to Task C.
Task C: the receive queue is full
Task A: Calls postMessage to Task C and deadlocks

The proposed solution here is to have a single atomic single that posts a message if possible and otherwise fails (e.g. postMessageIf). That eliminates the possibility of Task B sneaking in at Step 4.

Speaking of design goal: what is the design goal of postMessage? I do not have any perspective on how Worker is really used in Moddable.

Use it how you like. But, like everything in embedded development, assuming infinite resources has the potential to lead to disappointment.

Using the non-blocking version may not really be an option due to the semantics of these library modules, i.e., they may not have anything reasonable to do if they get an error return.

Agreed: code written for resource-rich environments is unlikely to be well positioned to benefit from tools (APIs) created for resource-constrained environments. There's nothing that can help there in all cases. What can be done is to provide tools to allow code written for resource-constrained environment to make optimal use of available resources.

beckerzito commented 1 year ago

Hey @phoddie I agree with the approach of the configurable queue size plus the postMessageIf (now it's clearer for me). Once it returns false in case the message wasn't posted then I can implement on the application side the necessary logs and exceptions based on the use case.

How can we manage to have it implemented? I'm not sure how the repository police work here. Are you the ones that implement it, or should I do that?

Thanks

phoddie commented 1 year ago

@beckerzito – see the contributing doc for info on PRs. You are welcome to give it a go.

I'm backing off on the need for postMessageIf a bit. It is a bad idea to push the queue length to the limit by overfilling. Adding an API to make that less dangerous only encourages the developers to try.

The reality is that postMessage can already throw. The Moddable SDK worker implementation will throw in some (admittedly obscure) circumstances today. I assume that's the case, even on the web -- since there are circumstances, like posting to a terminated worker and posting a message that cannot be serialized, that must fail.

If we assume that it is a bad idea to be pushing the queue length to the limit by overfilling it, then this is mostly a configuration problem:

Allow the default queue length to be changed
Allow posting to timeout (configurable). It throws on time out. The throw is an indication that there's a problem to solve.

If the caller doesn't handle the exception, the device resets like any unhandled exception. That's more-or-less like a deadlock getting reset by the watchdog, but easier to debug.

None of this precludes postMessageIf later, if there's a real need.

phoddie commented 1 year ago

Coming back to this topic.... I've implemented what is described above. It works nicely. I want to give it a little time to settle with internal testing before pushing to public, as changes to IPC have potentially far-reaching impact. Here's a summary of what to expect.

When postMessage fails, it throws an exception. That was always possible, but rare. Now it throws when a message send times out because the queue is full for too long. It can also throw if marshalling (serializing) the message runs out of memory. That should have always been the case, but a mistake in the implementation suppressed the exception.

To help detecting deadlocks, the default timeout for message sends is set to 1000 milliseconds. If all is working normally, it should be instantaneous. If the system is really busy, it might take a few milliseconds. If it takes 1000 milliseconds, something is probably wrong. By having this default, an exception is generated signaling a potential problem. This seems preferable to a deadlock. ;) In instrumented and release builds, the default timeout is infinite (same as today).

Both the timeout and message queue may be configured in the project manifest:

"defines": {
    "task": {
        "queueLength": 20,
        "queueWait": 100
    }
}

The worker.md documentation has been updated to describe these changes.

As part of this work, I've changed the handling of debugger IPC messages which previously shared the IPC message queue. To do that, it was necessary to reserve an entry in the queue for the debugger to prevent the debugger messages from ever blocking. This mostly worked, but was never 100% reliable. It also significantly complicates allowing finite message timeouts (queueWait). I've split out the debugger messages into a separate queue, which is only present in debug builds. This was possible using the FreeRTOS queue set feature (which I don't recall being available in the ESP-IDF when the original work on this was done, but perhaps I overlooked it). So a side benefit of this adventure is that JavaScript debugging with workers is more robust.

phoddie commented 1 year ago

These changes are now live in the repository. If you have a chance to try them out, I'm interested to hear how that goes. Thanks again to everyone for their input on this topic. This is a nice refinement on Worker and IPC support. The majority of developers who benefit from it don't need to be aware of the details, which is always great.