espruino / Espruino

The Espruino JavaScript interpreter - Official Repo
http://www.espruino.com/
Other
2.75k stars 740 forks source link

possible deadlock in NRF.setScan callback (or just bad JS code) #2057

Open fanoush opened 2 years ago

fanoush commented 2 years ago

Hello, probably this is my bad code and already found way to avoid it, but here is something that appears to hang and never end

Remote debugging using :3333
0x000240d8 in jshTransmit (data=32 ' ', device=EV_SERIAL_DEVICE_STATE_START) at src/jsdevices.c:228
228         while (txHeadNext==txTail) {
(gdb) bt
#0  0x000240d8 in jshTransmit (data=32 ' ', device=EV_SERIAL_DEVICE_STATE_START) at src/jsdevices.c:228
#1  jshTransmit (device=EV_SERIAL_DEVICE_STATE_START, data=32 ' ') at src/jsdevices.c:175
#2  0x0002413c in jsiConsolePrintChar (data=<optimized out>) at src/jsinteractive.c:219
#3  0x00024160 in jsiConsolePrintString (str=0x200020a5 <buf+1> "") at src/jsinteractive.c:228
#4  0x00020a9c in vcbprintf (user_callback=0x24145 <jsiConsolePrintString>, user_data=0x0, fmt=0x3db46 "%s", fmt@entry=0x24145 <jsiConsolePrintString> "\265\004", argp=..., argp@entry=...)
    at src/jsutils.c:815
#5  0x00020c46 in cbprintf (user_callback=<optimized out>, user_data=<optimized out>, fmt=0x3db44 "%q%s") at src/jsutils.c:850
#6  0x00035336 in jsfGetJSONForObjectItWithCallback (it=it@entry=0x20003710, flags=flags@entry=(JSON_SOME_NEWLINES | JSON_PRETTY | JSON_SHOW_OBJECT_NAMES),
    whitespace=whitespace@entry=0x3dae2 "  ", nflags=nflags@entry=(JSON_SOME_NEWLINES | JSON_PRETTY | JSON_SHOW_OBJECT_NAMES | JSON_INDENT),
    user_callback=user_callback@entry=0x24145 <jsiConsolePrintString>, user_data=user_data@entry=0x0, first=<optimized out>, first@entry=true) at src/jswrap_json.c:280
#7  0x00034e38 in jsfGetJSONWithCallback (var=0x2000247c <jsVars+948>, varName=0x0, flags=<optimized out>, whitespace=0x3dae2 "  ", user_callback=0x24145 <jsiConsolePrintString>, user_data=0x0)
    at src/jswrap_json.c:450
#8  0x0002a6f2 in jsfPrintJSON (flags=(JSON_SOME_NEWLINES | JSON_PRETTY | JSON_SHOW_OBJECT_NAMES), var=0x2000247c <jsVars+948>) at src/jswrap_json.c:500
#9  jswrap_interface_print (v=<optimized out>) at src/jswrap_interactive.c:269
#10 0x00021b04 in jsnCallFunction (function=function@entry=0x2a68d <jswrap_interface_print>, argumentSpecifier=JSWAT_FINISH, thisParam=thisParam@entry=0x28, paramData=0x200037f8,
    paramData@entry=0x26d1b <jspeUnaryExpression+174>, paramCount=paramCount@entry=1) at src/jsnative.c:223
#11 0x000224d6 in jspeFunctionCall (function=<optimized out>, functionName=<optimized out>, thisArg=<optimized out>, isParsing=<optimized out>, argCount=1, argPtr=<optimized out>)
    at src/jsparse.c:605
#12 0x00026b8a in jspeFactorFunctionCall () at src/jsparse.c:1173
#13 0x00026c5e in jspePostfixExpression () at src/jsparse.c:1772
#14 0x00026d1a in jspeUnaryExpression () at src/jsparse.c:1798
#15 0x00027042 in jspeBinaryExpression () at src/jsparse.c:1941
#16 jspeConditionalExpression () at src/jsparse.c:1977
#17 jspeAssignmentExpression () at src/jsparse.c:2036
#18 0x0002753a in jspeExpression () at src/jsparse.c:2042
#19 0x000228fa in jspeFunctionCall (function=0x1bdf7 <jslSingleChar+14>, functionName=0x20002e24 <lex>, thisArg=<optimized out>, isParsing=<optimized out>, argCount=0, argPtr=0x0)
    at src/jsparse.c:780
#20 0x00026b8a in jspeFactorFunctionCall () at src/jsparse.c:1173
#21 0x00026c5e in jspePostfixExpression () at src/jsparse.c:1772
#22 0x00026d1a in jspeUnaryExpression () at src/jsparse.c:1798
#23 0x00027042 in jspeBinaryExpression () at src/jsparse.c:1941
#24 jspeConditionalExpression () at src/jsparse.c:1977
#25 jspeAssignmentExpression () at src/jsparse.c:2036
#26 0x0002753a in jspeExpression () at src/jsparse.c:2042
#27 0x0002662e in jspeStatement () at src/jsparse.c:2752
#28 0x00026762 in jspeBlockNoBrackets () at src/jsparse.c:2070
#29 0x000267ec in jspeBlock () at src/jsparse.c:2100
#30 0x00027916 in jspeBlockOrStatement () at src/jsparse.c:2107
#31 0x00027ebe in jspeStatementIf () at src/jsparse.c:2178
#32 0x0002666c in jspeStatement () at src/jsparse.c:2767
#33 0x00026762 in jspeBlockNoBrackets () at src/jsparse.c:2070
#34 0x000267ec in jspeBlock () at src/jsparse.c:2100
#35 0x00027916 in jspeBlockOrStatement () at src/jsparse.c:2107
#36 0x00027ebe in jspeStatementIf () at src/jsparse.c:2178
#37 0x0002666c in jspeStatement () at src/jsparse.c:2767
#38 0x00026762 in jspeBlockNoBrackets () at src/jsparse.c:2070
#39 0x00022a70 in jspeFunctionCall (function=0x2000220c <jsVars+324>, functionName=0x20003be4, thisArg=<optimized out>, isParsing=<optimized out>, argCount=1, argPtr=0x20003c68)
    at src/jsparse.c:785
#40 0x000239f8 in jspExecuteFunction (func=<optimized out>, thisArg=<optimized out>, argCount=1, argPtr=0x20003c68) at src/jsparse.c:3036
#41 0x00033024 in jswrap_ble_setScan_cb (callback=0x2, filters=0x200029b0 <jsVars+2280>, adv=<optimized out>) at libs/bluetooth/jswrap_bluetooth.c:1724
#42 0x00021b04 in jsnCallFunction (function=function@entry=0x32f2d <jswrap_ble_setScan_cb>, argumentSpecifier=JSWAT_FINISH, thisParam=thisParam@entry=0x20002320 <jsVars+600>,
--Type <RET> for more, q to quit, c to continue without paging--info
    paramData=0x20003d18, paramData@entry=0x27b4b <jsiExecuteEventCallback+154>, paramCount=paramCount@entry=2) at src/jsnative.c:223
#43 0x000224d6 in jspeFunctionCall (function=<optimized out>, functionName=<optimized out>, thisArg=<optimized out>, isParsing=<optimized out>, argCount=2, argPtr=<optimized out>)
    at src/jsparse.c:605
#44 0x000239f8 in jspExecuteFunction (func=func@entry=0x20002710 <jsVars+1608>, thisArg=thisArg@entry=0x200020c8 <jsVars>, argCount=argCount@entry=1, argPtr=argPtr@entry=0x20003e48)
    at src/jsparse.c:3036
#45 0x00027b4a in jsiExecuteEventCallback (thisVar=thisVar@entry=0x200020c8 <jsVars>, callbackVar=callbackVar@entry=0x20002710 <jsVars+1608>, argCount=argCount@entry=1,
    argPtr=argPtr@entry=0x20003e48) at src/jsinteractive.c:1726
#46 0x00027be2 in jsiExecuteEventCallbackArgsArray (thisVar=0x200020c8 <jsVars>, callbackVar=0x20002710 <jsVars+1608>, argsArray=<optimized out>) at src/jsinteractive.c:1705
#47 0x0002efc6 in jsiExecuteEvents () at src/jsinteractive.c:1681
#48 jsiIdle () at src/jsinteractive.c:2204
#49 jsiLoop () at src/jsinteractive.c:2301
#50 main () at targets/nrf5x/main.c:37
(gdb) infor frame
Undefined command: "infor".  Try "help".
(gdb) info frame
Stack level 0, frame at 0x20003588:
 pc = 0x240d8 in jshTransmit (src/jsdevices.c:228); saved pc = 0x2413c
 inlined into frame 1
 source language c.
 Arglist at unknown address.
 Locals at unknown address, Previous frame's sp in sp
(gdb) quit

it was produced by following code

E.setFlags({pretokenise:1})

function findDevices(cb){
var devs = {};
var nonew=0;
function add(x){
if (!devs) return;
if (x.id in devs) {
  if (++nonew>30) {NRF.setScan(undefined);cb(devs);devs=undefined;}
} else {devs[x.id]=x.name;nonew=0;}
}
NRF.setScan(add,{active:true})
}

findDevices((x)=>print(x))

the issue is that I call print to console directly from NRF.setScan callback. deferring callback cb(devs); via setTimeout(()=>{cb(devs);devs=undefined;},0); does not cause this lockup.

Anyway the stack trace is interesting, here it is also in color, maybe it is more readable image the code is here https://github.com/espruino/Espruino/blob/86eba79a209f32a885e08ad34f688d406b04a8be/src/jsdevices.c#L228 it is basically neverending loop calling only jshBusyIdle() which does not call any output flushing or basically anything to get out of the loop. it is in the middle of formatting/printing JSON so has the buffer any chance to be actually flushed at this point?

So is this just me being stupid and trying to print from such callback or should the jshBusyIdle() actually help somehow with flushing the buffer - triggering some write maybe?

fanoush commented 2 years ago

here is how it looks in IDE image

fanoush commented 2 years ago

there is actually not any warning at https://www.espruino.com/Reference#l_NRF_setScan that the callback should be quick or should not to write to console or anything

fanoush commented 2 years ago

hmm this was nrf51, retried now with nrf52832 and it works just fine, tried many times and the print succeeds even when printing from inside NRF.setScan() so maybe it is some memory or buffer size or timing issue. the IDE with nrf51 stays connected for minutes with output frozen, pressing enter does nothing so it does not crash, just hang for long time while still being connected, but after some long time it disconnects and I cannot reconnect. when attaching gdb after such long wait it still loops in this same while loop. with nrf51 it happens basically everytime I try, cannot trigger this with nrf52.

MaBecker commented 2 years ago

could this be the reason for what happens in #2046 ?

MaBecker commented 2 years ago

@fanoush: I like to dive into debugging to nail down 2046. Can you please recommend a useful tutorial?

gfwilliams commented 2 years ago

could this be the reason for what happens in #2046 ?

I doubt it I'm afraid - your device stays responsive I believe? It's just not able to connect to others.

Is this reproducible just by trying to print a bunch of data, with nothing to do with setScan at all? The setScan callback isn't ever called from an IRQ - in a way it shouldn't be any different to any normal function call.

In nRF52 at least, there's a radio event (SWI1?) that occurs just before a connection interval on Bluetooth, and it's that function that pulls data out of the transmit queue and sends it back to the host. So when something in the 'main thread' (outside of IRQs) sends too much data for the buffer, it just waits for the buffer to be emptied by the IRQ (which is what you're seeing).

It's just in this case, that IRQ doesn't seem to be being called....

fanoush commented 2 years ago

OK thanks about that interrup tip, will dig more into it. Meanwhile I made nrf52 build to be as identical as possible - basically same board file, same low number of variables, even tweaked the scripts/build_platform_config.py to give it same small bufferSizeIO/TX/Timer like nrf51 and it still works fine there.

gfwilliams commented 2 years ago

This could be an SDK thing? I forget but maybe the nRF51 is on SDK11 and I think that's when support stopped? And there could be some issue with the radio IRQ?

fanoush commented 2 years ago

After some more debugging the main cause of the freeze is stack corruption. I had to reduce variables to 260 and it was pretty stable then. With nrf52 it was not seen as I did not reduce the stack size, just variables. Tried to paint the stack before running the code via

for (i=process.memory().stackEndAddress,x=0;x<64;i+=4,x++)poke32(i,0xdeadbeef)

and with more variables like 280 it was seen in gdb/openocd after freeze that it is definitely overwritten all over. With 260 it was very near the end - only 148 bytes left

>for (i=process.memory().stackEndAddress,x=0;x<64;i+=4,x++)print(i.toString(16),peek32(i).toString(16))
200031b0 deadbeef
200031b4 deadbeef
200031b8 deadbeef
200031bc deadbeef
200031c0 deadbeef
200031c4 deadbeef
200031c8 deadbeef
200031cc deadbeef
200031d0 deadbeef
200031d4 deadbeef
200031d8 deadbeef
200031dc deadbeef
200031e0 deadbeef
200031e4 deadbeef
200031e8 deadbeef
200031ec deadbeef
200031f0 deadbeef
200031f4 deadbeef
200031f8 deadbeef
200031fc deadbeef
20003200 deadbeef
20003204 deadbeef
20003208 deadbeef
2000320c deadbeef
20003210 deadbeef
20003214 deadbeef
20003218 deadbeef
2000321c deadbeef
20003220 deadbeef
20003224 deadbeef
20003228 deadbeef
2000322c deadbeef
20003230 deadbeef
20003234 deadbeef
20003238 deadbeef
2000323c deadbeef
20003240 deadbeef
20003244 20001050
20003248 deadbeef
2000324c 20000b18
20003250 deadbeef
20003254 20001050
20003258 deadbe07
2000325c 20000a78
20003260 deadbeef
20003264 20001050
20003268 deadbe03
2000326c 20000adc
20003270 deadbeef
20003274 20001050
20003278 deadbe14
2000327c 20000b40
20003280 20001340
20003284 20000cec
20003288 8
2000328c 1a591
20003290 20001340
20003294 14
20003298 0
2000329c 1a591
200032a0 44
200032a4 20001340
200032a8 14
200032ac 0
=undefined

There may be also secondary issue with the interrupts. The sd_nvic_critical_region_enter stuff I added requires proper balancing of the calls with nested flag being local variable. See e.g. https://devzone.nordicsemi.com/f/nordic-q-a/5237/sd_nvic_region_enter-exit or or definition here

With global variable like it is now I can imagine it may stay at 1 and never free the interrupts when the calls are not properly balanced. The sd_nvic_critical_region_exit does not enable the interrupts back when the parameter passed is not zero . I first took the pain and rewritten all jshInterruptOn/Off usage to pass this nesting parameter but to keep current way of not checking the balancing the easiest is to simply call it always with 0, then interrupts are always enabled if they were disabled before. will submit PR.

fanoush commented 2 years ago

I guess this issue can be closed unless you want to make the stack checking better and avoid stack corruption in cases like this

fanoush commented 2 years ago

have simple code

var d;
NRF.setScan(function(x){d=x;print(d);})

running for an hour already and the webide is just scrolling with the output without any disconnect. this one is not memory/stack intensive, however at least bluetooth and console priniting looks stable and it does excercise setting variables too.

EDIT: oh, even this one is stack intensive, did the stack painting and after running this the stack usage was very similar to the complex code

40 20003250 deadbeef
41 20003254 deadbeef
42 20003258 deadbeef
43 2000325c deadbeef
44 20003260 deadbeef
45 20003264 deadbeef
46 20003268 deadbeef
47 2000326c deadbeef
48 20003270 deadbeef
49 20003274 deadbeef
50 20003278 deadbeef
51 2000327c deadbeef
52 20003280 deadbeef
53 20003284 deadbeef
54 20003288 deadbeef
55 2000328c deadbeef
56 20003290 deadbeef
57 20003294 20001050
58 20003298 deadbeef
59 2000329c 20000b2c
60 200032a0 deadbeef
61 200032a4 deadbeef
62 200032a8 deadbe01
63 200032ac deadbeef
=undefined
>proces
process             process
>process.memory()
={ free: 211, usage: 49, total: 260, history: 60,
  gc: 0, gctime: 3.44848632812, blocksize: 12, stackEndAddress: 536883632, flash_start: 0,
  flash_binary_end: 249160, flash_code_start: 256000, flash_length: 262144 }
>process.memory().stackEndAddress.toString(16)
="200031b0"
> 

so only room for 56 32bit values on stack with 260 variables total.

gfwilliams commented 2 years ago

Thanks! I'm actually pretty surprised FunctionCall doesn't check the stack before recursing, but I guess the issue is that while Espruino leaves enough stack for itself, when the IRQ for setScan runs, that uses up stack too? So maybe we could make it leave a bit more stack before it refuses to recurse?

It's not great having that little stack space available though. I guess maybe we ought to reduce variables even more.

It might be possible to reduce stack usage too though. In the past I did do some debugging checking where space was going and trying to get it down, but it's possible it's crept back up a bit recently.