Moddable-OpenSource / moddable

Tools for developers to create truly open IoT products using standard JavaScript on low cost microcontrollers.
http://www.moddable.com
1.32k stars 236 forks source link

Socket.write throws an uncatchable exception on ESP32 (maybe a bug, maybe not...) #1410

Open cmidgley opened 6 days ago

cmidgley commented 6 days ago

Build environment: Windows Moddable SDK version: 4.9.5 Target device: ESP32

Description Socket (original, not io) triggers an uncatchable exception and reboots, during a socket.write call (even if enclosed in try/promise.catch).

I'm unsure this is a real bug. The use case works on Win simulator, butcrashes on ESP32. The condition is where there is a listener and a client connected, the server sends a single message, which the client reads and then force closes the server-side client socket handle (this is the questionable part!). The following attempt to write to that socket throws an exception that is not catchable and it reboots.

This is not a condition that a real network would experience (as a client would never be able to close a server-side client socket). I've encountered it only in my automated test suite which has a wide range of edge case tests to look for appropriate handling of error conditions. I close the server-side client socket to emulate a server connection dropping to ensure the client side would behave correctly - sketchy, for sure.

My concern is that if this, or any other network condition, occurs inside the code (which is inside some promise and not catchable) that might be able to happen with real network usage.

The uncatchable exception occurs in lwip/modSocket.c (line 615 in 4.9.5) on a xsUnknownError("suspended"); call when I call socket.write on the now-closed socket.

My ask is if somebody might kindly check out my small repo that immediately reproduces this failure and do a quick review of the Socket implementation to determine if this condition (or any other when inside an uncontrolled promise) could ever occur with a real network. If not, great - I'll just skip this automated test!

p.s. There may be another bug I found, again not sure ... it's commented in the sample repo. When you get a Socket.readable event but do not read the data, and then the server-side socket is closed, you never receive the Socket.disconnected event. This isn't an issue for me, and likely not for others, as you would normally read whatever is pending on the socket. Just thought I'd mention it in case it raises any concerns on your side. This behavior is the same on Win and ESP32.

Steps to Reproduce

  1. Clone this repo
  2. Run mcconfig -m -i -p esp32 ssid=xxx password=xxx
  3. See uncaught exception error

Running this with mcconfig -m -d (win simulator) works fine.

phoddie commented 6 days ago

Line 615 of lwip/modSocket.c is the second of these two:

    if (xss->suspended)
        xsUnknownError("suspended");

But xss->suspended should be false in your test. The only way to set it true is to call suspend(), which your test doesn't do.

xsUnknown() throws a normal JavaScript exception, one that can be caught.

When I run the test, I don't see an exception. I do see the trace at line 622:

xsTrace("write on closed socket\n");

The uncatchable exception occurs in lwip/modSocket.c (line 615 in 4.9.5) on a xsUnknownError("suspended"); call when I call socket.write on the now-closed socket.

I think what may be happening is you the call to write() after close() ends up using the disposed native state. In your case, perhaps, the block is reused and so xss->suspend is true. To test that... at the bottom of xs_socket_close add:

xsmcSetHostData(xsThis, NULL);

And change the top of xs_socket_write:

void xs_socket_write(xsMachine *the)
{
    xsSocket xss = xsmcGetHostData(xsThis);
    if (!xss)
        xsUnknownError("write after close");
    ...

FWIW – the IO Socket doesn't have this problem as it does a brand-check on all native APIs and the brand is invalidated on close.

cmidgley commented 6 days ago

Those changes appear to resolve this issue - thank you!

If I understand them correctly, the failure was due to accessing a disposed object (induced by my unusual use case) and therefore random in behavior. Does this potentially explain why the thrown error would become an uncaught exception rather than hitting my try/catch (as the updated version does)?

phoddie commented 6 days ago

Those changes appear to resolve this issue - thank you!

Good. The right way to solve that would be to use xsmcGetHostDataValidate() to do the brand check, instead of xsmcGetHostData() That should be done across all APIs, not just write. (close is a special case, because it can be called more than once... see io/spi/esp/spi.c for the details).

If I understand them correctly, the failure was due to accessing a disposed object

Correct.

Does this potentially explain why the thrown error would become an uncaught exception rather than hitting my try/catch (as the updated version does)?

Maybe? If some code wrote to memory it no longer owned, some state could be corrupted. There's no such thing as an "uncatchable exception" in XS, so clearly something is in an invalid state.

cmidgley commented 6 days ago

Got it. So where do we go from here? Is this something that might get updated in a future release? With additional guidance, I might be able to make a PR for it, but I'm at all steeped in XS to C (I've managed to implement a few small things, but I remain mostly lost when looking at the code). Case in point - xsmcGetHostDataValidate - I don't (yet) know if that means replacing all xsmcGetHostData calls or if there is some side effect to be aware of, and I don't know which files would need to be updated (especially given the wide range of platform implementations of this code).

phoddie commented 5 days ago

As I rule, I'm trying not to put time into maintaining the original APIs so we can get cut over to 419.

But this is ugly and you have contributed quite a bit, so... I made the changes. That'll be in our next public update. Meanwhile, please do give them a try -- they are here as a gist. There was one unexpectedly tricky detail that I had to fix as part of this, so more testing would be helpful.

I verified that all the network unit tests pass with the changes. There are no unit tests for socket itself though (there are for 419 socket!).

cmidgley commented 5 days ago

Thank you so much. I will give it a try soon.

The only reason I am using Socket is because WebSocket on io does not have a server, so I went with the old version. I've invested a stupid number of hours into the old one - it had tons of issues, but it's nearly fully hardened now (just a few more bugs... hah!). But I now know the WebSocket protocol really well...!

I could migrate this code over to io, but I see little benefit as eventually I want to switch to the implementation on your code base. All my actual use of it is wrapped in another layer that presents a consistent interface across Node and XS, so changing what's underneath doesn't impact my application. Is there by chance a spec (draft?) on a websocket server for 419? If so, maybe I should switch gears now and go implement that (and provide in a PR)?

I remain confused about why the io modules are under examples, like websocket and mqtt (thanks for responding to that discussion). Are they still considered supported / official modules, and perhaps just in example because of the lack of final 419 ratification or something?

What do you recommend?

phoddie commented 5 days ago

The only reason I am using Socket is because WebSocket on io does not have a server, so I went with the old version

Don't tell that to this example of creating a WebSocket server using 419 APIs.

The structure is different -- instead of a standalone WebSocket server, the WebSocket server is a byproduct of the HTTP server architecture. This matches the approach taken in many other architectures. It has some nice effects, like fully supporting paths for WebSocket endpoints.

I remain confused about why the io modules are under examples...

That's where they started. They need to move. Sorting out exactly where is a little less obvious but we'll get there. Suggestions?

... perhaps just in example because of the lack of final 419 ratification or something?

The HTTP and WebSocket stuff is all in 2nd Edition which has been an official Standard for well over a year.

cmidgley commented 5 days ago

Doh! How did I miss that??

I need to do a deep dive into the implementation, but it looks promising. I didn't see any typings for them... or are those part of some 419 repo somewhere?

phoddie commented 5 days ago

Doh! How did I miss that??

🤷‍♀️

I didn't see any typings for them... or are those part of some 419 repo somewhere?

Not yet. There are some contributed typings for 419 networking, including HTTP client. That may be a helpful starting point. Fortunately, the 419 APIs tend to be concise and re-use a lot a patterns.

cmidgley commented 5 days ago

I tried the gist, but it crashes with what looks like memory corruption. Running the exact same test suite, with these two it is working very well:

... at the bottom of xs_socket_close add:

xsmcSetHostData(xsThis, NULL);

And change the top of xs_socket_write:

void xs_socket_write(xsMachine *the)
{
  xsSocket xss = xsmcGetHostData(xsThis);
  if (!xss)
      xsUnknownError("write after close");
  ...

But with the gist version, I get this almost immediately:

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x400ee531  PS      : 0x00060830  A0      : 0x8019a2e0  A1      : 0x3ffc7350  
0x400ee531: fxMarkValue at c:\moddable\xs\sources/xsMemory.c:1289

A2      : 0x3ffc7b48  A3      : 0xbd72486d  A4      : 0x0000014f  A5      : 0x00000000
A6      : 0xffffff80  A7      : 0x00000000  A8      : 0xffffff93  A9      : 0x00000093  
A10     : 0x0000000b  A11     : 0x00000020  A12     : 0x400ee52c  A13     : 0x00000000
0x400ee52c: fxMarkValue at c:\moddable\xs\sources/xsMemory.c:1284

A14     : 0xffffff80  A15     : 0x3f974794  SAR     : 0x00000020  EXCCAUSE: 0x0000001c  
EXCVADDR: 0xbd724871  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0x00000000
0x4000c46c: memset in ROM
0x4000c477: memset in ROM

Backtrace: 0x400ee52e:0x3ffc7350 0x4019a2dd:0x3ffc7370 0x400ee951:0x3ffc7390 0x400dcb56:0x3ffc73c0 0x40112aa5:0x3ffc73e0 0x400fb539:0x3ffc7400 0x400e8e42:0x3ffc7490 0x400e9d37:0x3ffc7530 0x400fbaa5:0x3ffc7550 0x400e8e42:0x3ffc75e0 0x400e9145:0x3ffc7680 0x400fb42b:0x3ffc76a0 0x400f4222:0x3ffc7730 0x400fb42b:0x3ffc77d0 0x400f6465:0x3ffc7860 0x400d9eb1:0x3ffc78f0 0x400da245:0x3ffc7910 0x400d9bde:0x3ffc7940 0x4008a9bd:0x3ffc7960
0x400ee52e: fxMarkValue at c:\moddable\xs\sources/xsMemory.c:1284
0x4019a2dd: fxMark at c:\moddable\xs\sources/xsMemory.c:761
0x400ee951: fxCollect at c:\moddable\xs\sources/xsMemory.c:308
0x400dcb56: fxCollectGarbage at c:\moddable\xs\sources/xsAPI.c:1812
0x40112aa5: xs_gc at c:\moddable\modules\base\debug/modDebug.c:28
0x400fb539: fxRunID at c:\moddable\xs\sources/xsRun.c:926
0x400e8e42: fxStepAsync at c:\moddable\xs\sources/xsFunction.c:667
0x400e9d37: fxRunAsync at c:\moddable\xs\sources/xsFunction.c:649
0x400fbaa5: fxRunID at c:\moddable\xs\sources/xsRun.c:1090
0x400e8e42: fxStepAsync at c:\moddable\xs\sources/xsFunction.c:667
0x400e9145: fxResolveAwait at c:\moddable\xs\sources/xsFunction.c:632
0x400fb42b: fxRunID at c:\moddable\xs\sources/xsRun.c:868
0x400f4222: fxOnResolvedPromise at c:\moddable\xs\sources/xsPromise.c:532
0x400fb42b: fxRunID at c:\moddable\xs\sources/xsRun.c:868
0x400f6465: fxRunPromiseJobs at c:\moddable\xs\sources/xsPromise.c:1241
0x400d9eb1: doRunPromiseJobs at c:\moddable\xs\platforms\esp/xsHost.c:1324
0x400da245: modMessageService at c:\moddable\xs\platforms\esp/xsHost.c:1078
0x400d9bde: loop_task at C:/moddable/build/tmp/esp32/instrument/websocket/xsProj-esp32/main/main.c:155 (discriminator 1)
0x4008a9bd: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v5.3.1/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

If you are doing this change just for me, then perhaps back burner this for a bit. I have the legacy websocket hardened and compliant, I'm going to look at the io version and see how that goes. If successful I'll no longer be using legacy networking.

Do you want a PR for the legacy websocket? I tried to make it fully backward compatible while adding many improvements (including reading from fragmented packets, write masking, close frames with code/reason, improved error handling and callbacks, typings incl. fix to socket typing, etc). It is a substantial overhaul - same structure as before, but perhaps 10% of the code isn't changed. I'm not sure you want it, as it's legacy and it's something else to maintain/own, but I'm glad to provide it.

phoddie commented 4 days ago

I didn't see the crash in the GC because the GC wasn't running in the repo you posted (there wasn't enough going on). I added code to force GC and the crash happens there too:

import Timer from "timer";
import Debug from "debug";

Timer.repeat(() => {
trace("gc\n");
    Debug.gc();
}, 1000)

Basically, the bug is that I forgot how xsForget works. 🙄 I updated the Gist. Maybe you'll have better luck with that?

cmidgley commented 4 days ago

The latest gist is working - it detects writes to closed sockets, and throws catchable errors. Thanks!

phoddie commented 4 days ago

Excellent. Thanks for confirming. It brings a some more stability to users of the original network stack.

cmidgley commented 2 days ago

I've just discovered that with the latest gist, a listener socket doesn't close (resource leak). The prior gist (with the two simple changes) does close the listener socket. Use examples/network/socketlistener and you can see in Instruments that Network Sockets is at remains at 1 after calling this.close().

phoddie commented 2 days ago

I think it may just the changes to how the instrumentation counter is adjusted. If the socket is being leaked, you would see free RAM decreasing. Anyway.. will take a look.

phoddie commented 2 days ago

Yea, just the instrumentation counter. Updated the Gist.

Note: the listener close is always asynchronous. The socket is disposed on the next turn after close() completes. As a result, the counter doesn't get decremented by close, but shortly thereafter.

cmidgley commented 1 day ago

I applied the latest gist and now instruments show listener count being reflected correctly, wonderful!

There are two more edge cases where things still aren't right (likely incorrect instrumentation count on esp32 and lockup due to no callback on win), but I'm not sure this is worth spending more time on. Your gist is great to avoid crashes/reboots on esp32, but instrumentation being off isn't the end of the world (and win is just a simulator). As my backlog clears out I intend to migrate to io where I can see being more picky about these edge cases.


In case you are nterested in learning more about the conditions I've found, here is a brief writeup.

On esp32, the scenario is as follows:

The only additional info I have is demonstrated in this xsbug output. The first this.socket is right before calling close and the second is directly afterwards. I think this likely means it's another instrumentation counter issue?

> this.socket
{ (host): C data, callback: {…} }
c:\moddable\modules\network\websocket\websocket.js (243) # Break: step!
> this.socket
{ (host): NULL, callback: {…} }

And on win simulator: