Clozure / ccl

Clozure Common Lisp
http://ccl.clozure.com
Apache License 2.0
841 stars 105 forks source link

'Waiting for recursive lock' crash (CCL1.11, CCL1.12 on ARM32) #428

Open zinadv opened 1 year ago

zinadv commented 1 year ago

Copy this code to /tmp/bug.lisp.

(in-package :cl-user)

(ql:quickload :bordeaux-threads)
(shadow 'terminate)
(ql:quickload :iterate)
(use-package :iterate)

(defun worker-read-byte (&aux file byte)
  ;; sharing can be (:private :lock :external)
  (setf file (open "/tmp/bug.lisp" :element-type '(unsigned-byte 8)))
  (setf byte (read-byte file))
  (close file)
  byte)

(defun main (n)
  (let (threads)    
    (setf threads
          (iter
            (for i :from 1 :to n)
            (for thread = (bt:make-thread #'(lambda () (worker-read-byte)) :name "tracking"))
            (collect thread)))
    (write
     (iter
       (for thread :in threads)
       (collect (bt:join-thread thread))))
    (terpri)))

(iter
  (main 16)
  (sleep 1))

After several iterations crashes:

Unhandled exception 4 at 0x105901f8, context->regs at #x75fe3838
? for help
[30173] Clozure CL kernel debugger:

Works fine on amd64 (but I need ARM32). If remove work with files then works fine, but I need a lot of streaming.

Didn't use WITH-OPEN-FILE to check is it argument :ABORT of function CLOSE. It can be NIL.

Tried every of (:private :lock (:external nil)) values to argument :sharing of function OPEN, because sometimes (with/out explicit JOIN-PROCESS) condition arises:

> Error: Current process #<PROCESS tracking(335) [Active] #x14E7FAF6> does not own lock #<RECURSIVE-LOCK [ptr @ #x12F7E20] #x140CF4D6>
> While executing: CCL::%UNLOCK-RECURSIVE-LOCK-OBJECT, in process tracking(335).

If I specify N different file names -- bug preserves.

If I enclose entire worker into lock -- bug preserves.

xrme commented 1 year ago

This sounds like a bug that we've struggled with on 32-bit ARM for a long time.

xrme commented 1 year ago

[from the mailing list]

A test case is here: https://github.com/jetmonk/openmcl-thread-test

Running (threadtest2) triggers a modified with-lock macro that grabs a lock, checks that the current process in fact owns it, runs body, and again checks for ownership before releasing. After many iterations, an error is thrown when the macro discovers that some other process stole the lock.

I thought it might be a lack of ARM DMB instructions ("ensures that all memory accesses are completed before new memory access is committed.") in atomic routines , which have been discussed as necessary for ARM mutexes elsewhere online, like
https://stackoverflow.com/questions/42698277/arm-cortex-m4-mutex-lock-dmb-instruction
Apparently the issue is different cores not realizing another core has changed memory until (DMB) forces synchronization.

But I tried stuffing (dmb) everywhere in ARM/arm-misc.lisp and it didn't fix the problem.

[From a folllowup post] Your use of "taskset" isn't right. You don't need "sudo", and you do need the "-a" option when you use it on a CCL instance that's already running. Alternatively, you can pin a CCL instance to a single core before it runs, by starting it with a command like "taskset 2 ccl".

On my pine64 device, running CCL 1.12.1 on Debian arm64 "testing", your THREADTEST2 crashes if I run CCL the normal way, but completes successfully if I run CCL on a single core via "taskset 2 ccl".

varjagg commented 1 month ago

This is probably not a synchronization issue as we're experiencing the crashes on a single core arm32 system. Last known good version is 11.5.

xrme commented 1 month ago

@varjagg are you reporting that ccl 1.11.5 works OK, and you see the issue on later versions?

varjagg commented 1 month ago

@xrme yep we've been testing our system on 11.5 for ages but as we tried to move to 1.12.2 it started to melt down. Same for any release between the two. At first we suspected our code but yesterday it fell apart just in swank interaction on a fresh image. But now that I reproduced it again am not even sure it's the same exact issue. The "Unhandled exception 4 at…" part is identical but it blows up on multiplication now:

? Unhandled exception 4 at 0x1a068, context->regs at #x75148258
? for help                          
[4097] Clozure CL kernel debugger: t                                           
Current Thread Context Record (tcr) = 0x7693bf18
Control (C) stack area:  low = 0x74fe6000, high = 0x7514a000
Value (lisp) stack area: low = 0x74eca000, high = 0x74fe4000

[4097] Clozure CL kernel debugger: b
current thread: tcr = 0x7693bf18, native thread ID = 0x12ef, interrupts enabled

(#x75148528) #x10024664 : #<Function MULTIPLY-BIGNUMS #x1400d50e> + 312
(#x75148538) #x108E3D34 : #<Function CURRENT-TIME-IN-NANOSECONDS #x143f0e3e> + 2
64

Then current-time-in-nanoseconds is called from another function compiled with (optimize (speed 3) (space 0) (safety 1)) and inlined elsewhere. Though I believe this code does work most of the time so it's hardly a way to reproduce it.

varjagg commented 1 month ago

Another backtrace. Now from socket code:

Unhandled exception 4 at 0x1a068, context->regs at #x745c3288         [820/1835]
? for help                                                                      
[6539] Clozure CL kernel debugger: t                                            
Current Thread Context Record (tcr) = 0x76505f40                                
Control (C) stack area:  low = 0x74461000, high = 0x745c5000                    
Value (lisp) stack area: low = 0x73905000, high = 0x73a1f000                    
Exception stack pointer = 0x745c3558                                            
[6539] Clozure CL kernel debugger: b                                            current thread: tcr = 0x76505f40, native thread ID = 0x2034, interrupts disabled

(#x745C3558) #x10071BF4 : #<Function %CONS-NHASH-VECTOR #x14021376> + 176       
(#x745C3568) #x100603B0 : #<Function %LOCK-FREE-REHASH #x1401e066> + 420        
(#x745C3578) #x1005FE4C : #<Function LOCK-FREE-REHASH #x1401de6e> + 224         
(#x745C3588) #x10062A28 : #<Function LOCK-FREE-PUTHASH #x1401e8ee> + 980        
(#x745C3598) #x14D40A5C : #<Function ADD-WAITER #x14d40b1e> + 112               
(#x745C35A8) #x14D40EEC : #<Function MAKE-WAIT-LIST #x14d40f1e> + 224           (#x745C35B8) #x14D3ED20 : #<Function WAIT-FOR-INPUT #x14d3f15e> + 340           
(#x745C35C8) #x00018CAC : (subprimitive ret1valn)                               

…however, one of other threads appears to have bad frame:

TCR = 0x19d6d10, cstack area #x19eace0,  native thread ID = 0x2033, interrupts enabled                                                                         

Bad frame! (0x74333600)                                                         
xrme commented 1 month ago

I don't remember what ARM-specific changes took place from 1.11.5 to newer releases. Clearly something must have happened, though.

I wonder if there's some runtime (gc, exception handling) issue.

Signal 4 is SIGILL, by the way.

phoe commented 1 month ago

SIGILL? What instruction is the CPU executing at that moment? Is it possible to disassemble whatever is at the program counter at the time of the signal?

xrme commented 1 month ago

SIGILL is frequently used normally for GC and various other traps, so it's not surprising to get it.

The lisp kernel debugger can print out the registers (with "r"). We'd need to attach a debugger to find out more.

varjagg commented 1 month ago

This is farily reproducible on our target so I'll get the register values tomorrow. Is there anything specific to instrument armcl for debugger?

xrme commented 1 month ago

If you end up in the lisp kernel debugger, you should be able to attach gdb.

https://trac.clozure.com/ccl/wiki/CclUnderGdb might help a little.

I think armcl is already compiled with -g, but if it isn't, edit lisp-kernel/linuxarm/Makefile and rebuild the lisp kernel.

454 is a bug on the ARM where calling (#_getpid) from the top level crashes the lisp hard. But that happens on 1.11.5 too, so that may not be relevant here.

varjagg commented 1 month ago

OK, so I poked around a bit. Not sure how helpful is it but there it goes:

Unhandled exception 4 at 0x1a068, context->regs at #x74ab3288
? for help
[31507] Clozure CL kernel debugger: t
Current Thread Context Record (tcr) = 0x76514c78
Control (C) stack area:  low = 0x74951000, high = 0x74ab5000
Value (lisp) stack area: low = 0x74315000, high = 0x7442f000
Exception stack pointer = 0x74ab3558
[31507] Clozure CL kernel debugger: r
r00 = 0x0001544A    r08 = 0x1401CE3E
r01 = 0x00000002    r09 = 0x00000033
r02 = 0x00000552    r10 = 0x7442EE7C
r03 = 0x76514C78    r11 = 0x00000000
r04 = 0x00000128    r12 = 0xFFFFFFF8
r05 = 0x00000550    r13 = 0x74AB3558
r06 = 0x00000550    r14 = 0x0001A37C
r07 = 0x18120000    r15 = 0x0001A068
[31507] Clozure CL kernel debugger: b                                           
current thread: tcr = 0x76514c78, native thread ID = 0x498, interrupts disabled 

(#x74AB3558) #x10071BF4 : #<Function %CONS-NHASH-VECTOR #x14021376> + 176       
(#x74AB3568) #x100603B0 : #<Function %LOCK-FREE-REHASH #x1401e066> + 420        
(#x74AB3578) #x1005FE4C : #<Function LOCK-FREE-REHASH #x1401de6e> + 224         
(#x74AB3588) #x10062A28 : #<Function LOCK-FREE-PUTHASH #x1401e8ee> + 980
(#x74AB3598) #x14D34E7C : #<Function ADD-WAITER #x14d34f3e> + 112               
(#x74AB35A8) #x14D3530C : #<Function MAKE-WAIT-LIST #x14d3533e> + 224           
(#x74AB35B8) #x14D33140 : #<Function WAIT-FOR-INPUT #x14d3357e> + 340           
(#x74AB35C8) #x00018CAC : (subprimitive ret1valn)

### attaching GDB

(gdb) info threads

…
  ### (0x498 is 1176)
  16   LWP 1176 "Background proc" 0x76df59a4 in read () from /lib/libc.so.6
…

(gdb) thread 16
[Switching to thread 16 (LWP 1176)]
#0  0x76df59a4 in read () from /lib/libc.so.6

(gdb) bt
#0  0x76df59a4 in read () from /lib/libc.so.6
#1  0x76d9df18 in _IO_file_underflow () from /lib/libc.so.6
#2  0x76d9f068 in _IO_default_uflow () from /lib/libc.so.6
#3  0x76d991bc in getc () from /lib/libc.so.6
#4  0x0002a830 in readc () at ../lisp-debug.c:243
#5  0x0002afc8 in lisp_Debugger (xp=xp@entry=0x74ab3268, 
    info=info@entry=0x74ab31e8, why=why@entry=4, 
    in_foreign_code=in_foreign_code@entry=0, 
    message=0x74ab2fc8 "Unhandled exception 4 at 0x1a068, context->regs at #x74ab3288", message@entry=0x74ab2fc0 "h2\253t") at ../lisp-debug.c:1558
#6  0x00026d5c in signal_handler (signum=4, info=0x74ab31e8, 
    context=0x74ab3268) at ../arm-exceptions.c:1461
#7  <signal handler called>
#8  _SPmisc_alloc () at ../arm-spentry.s:2029
#9  0x0001a37c in _SPmisc_alloc_init () at ../arm-spentry.s:2267
#10 0x10071bf4 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) info registers
r0             0x0      0
r1             0x765127f0       1985030128
r2             0x400    1024
r3             0x0      0
r4             0x400    1024
r5             0x765127f0       1985030128
r6             0x0      0
r7             0x3      3
r8             0x2f09c  192668
r9             0x40898  264344
r10            0x0      0
r11            0x4095c  264540
r12            0x0      0
sp             0x74ab2f08       0x74ab2f08
lr             0x0      0
pc             0x76df59a4       0x76df59a4 <read+96>
cpsr           0x800f0010       -2146500592

(gdb) frame 6
#6  0x00026d5c in signal_handler (signum=4, info=0x74ab31e8, 
    context=0x74ab3268) at ../arm-exceptions.c:1461
warning: Source file is more recent than executable.
1461        if (lisp_Debugger(context, info, signum, (old_valence != TCR_STATE_LISP), msg)) {
(gdb) info frame
Stack level 6, frame at 0x74ab31e8:
 pc = 0x26d5c in signal_handler (../arm-exceptions.c:1461); 
    saved pc = 0x76d62d30
 called by frame at 0x74ab31e8, caller of frame at 0x74ab2fb8
 source language c.
 Arglist at 0x74ab2fb8, args: signum=4, info=0x74ab31e8, context=0x74ab3268
 Locals at 0x74ab2fb8, Previous frame's sp is 0x74ab31e8
 Saved registers:
  r4 at 0x74ab31c8, r5 at 0x74ab31cc, r6 at 0x74ab31d0, r7 at 0x74ab31d4,
  r8 at 0x74ab31d8, r9 at 0x74ab31dc, r10 at 0x74ab31e0, lr at 0x74ab31e4

(gdb) frame 8
#8  _SPmisc_alloc () at ../arm-spentry.s:2029
warning: Source file is more recent than executable.
2029            __(Misc_Alloc(arg_z,imm0,imm2))
(gdb) info frame
Stack level 8, frame at 0x74ab3558:
 pc = 0x1a068 in _SPmisc_alloc (../arm-spentry.s:2029); saved pc = 0x1a37c
 called by frame at 0x74ab3568, caller of frame at 0x74ab31e8
 source language asm.
 Arglist at 0x74ab3558, args: 
 Locals at 0x74ab3558, Previous frame's sp is 0x74ab3558
(gdb) x/i  0x1a068            
   0x1a068 <_SPmisc_alloc+120>: sub     r12, r12, r2
(gdb) x/i 0x1a37c
   0x1a37c <_SPmisc_alloc_init+28>:     pop     {r0, r10, r11, lr}

(gdb) frame 9
#9  0x0001a37c in _SPmisc_alloc_init () at ../arm-spentry.s:2267
2267            __(bl _SPmisc_alloc)
(gdb) info frame
Stack level 9, frame at 0x74ab3568:
 pc = 0x1a37c in _SPmisc_alloc_init (../arm-spentry.s:2267); 
    saved pc = 0x10071bf4
 called by frame at 0x74ab3568, caller of frame at 0x74ab3558
 source language asm.
 Arglist at 0x74ab3558, args: 
 Locals at 0x74ab3558, Previous frame's sp is 0x74ab3568
 Saved registers:
  r0 at 0x74ab3558, r10 at 0x74ab355c, r11 at 0x74ab3560, lr at 0x74ab3564
(gdb) x/i 0x10071bf4
   0x10071bf4:  str     r4, [r10, #-4]!

(gdb) frame 10
#10 0x10071bf4 in ?? ()
(gdb) info frame
Stack level 10, frame at 0x74ab3568:
 pc = 0x10071bf4; saved pc = <not saved>
 Outermost frame: previous frame identical to this frame (corrupt stack?)
 caller of frame at 0x74ab3568
 Arglist at 0x74ab3568, args: 
 Locals at 0x74ab3568, Previous frame's sp is 0x74ab3568

(gdb) disas 0x1a37c
Dump of assembler code for function _SPmisc_alloc_init:
   0x0001a360 <+0>:     mov     r0, #19
   0x0001a364 <+4>:     push    {r0, r10, r11, lr}
   0x0001a368 <+8>:     mov     r11, #0
   0x0001a36c <+12>:    mov     r9, r4
   0x0001a370 <+16>:    mov     r4, r5
   0x0001a374 <+20>:    mov     r5, r6
   0x0001a378 <+24>:    bl      0x19ff0 <_SPmisc_alloc>
   0x0001a37c <+28>:    pop     {r0, r10, r11, lr}
   0x0001a380 <+32>:    mov     r5, r9
   0x0001a384 <+36>:    mov     r0, #205        ; 0xcd
   0x0001a388 <+40>:    orr     r0, r0, #512    ; 0x200
   0x0001a38c <+44>:    add     r8, r0, #67108865       ; 0x4000001
   0x0001a390 <+48>:    mov     r2, #8
   0x0001a394 <+52>:    ldr     r9, [r8, #6]
   0x0001a398 <+56>:    ldr     pc, [r9, #-2]
End of assembler dump.

(gdb) disas 0x1a068
Dump of assembler code for function _SPmisc_alloc:                              
   0x00019ff0 <+0>:     tst     r5, #-67108861  ; 0xfc000003                    
   0x00019ff4 <+4>:     bne     0x1a08c <_SPmisc_alloc+156>
   0x00019ff8 <+8>:     asr     r0, r4, #2
   0x00019ffc <+12>:    orr     r0, r0, r5, lsl #6
   0x0001a000 <+16>:    and     r1, r0, #7
   0x0001a004 <+20>:    cmp     r1, #2
   0x0001a008 <+24>:    mov     r2, r5
   0x0001a00c <+28>:    beq     0x1a05c <_SPmisc_alloc+108>
   0x0001a010 <+32>:    asr     r1, r4, #2
   0x0001a014 <+36>:    cmp     r1, #191        ; 0xbf
   0x0001a018 <+40>:    ble     0x1a05c <_SPmisc_alloc+108>
   0x0001a01c <+44>:    lsr     r2, r5, #2
   0x0001a020 <+48>:    cmp     r1, #207        ; 0xcf
   0x0001a024 <+52>:    ble     0x1a05c <_SPmisc_alloc+108>
   0x0001a028 <+56>:    lsr     r2, r5, #1
   0x0001a02c <+60>:    cmp     r1, #223        ; 0xdf
   0x0001a030 <+64>:    ble     0x1a05c <_SPmisc_alloc+108>
   0x0001a034 <+68>:    cmp     r1, #247        ; 0xf7
   0x0001a038 <+72>:    lsleq   r2, r5, #2
   0x0001a03c <+76>:    addeq   r2, r2, #4
   0x0001a040 <+80>:    beq     0x1a05c <_SPmisc_alloc+108> 
   0x0001a044 <+84>:    cmp     r1, #255        ; 0xff
   0x0001a048 <+88>:    addeq   r2, r5, #28
   0x0001a04c <+92>:    lsreq   r2, r2, #5
   0x0001a050 <+96>:    beq     0x1a05c <_SPmisc_alloc+108>
   0x0001a054 <+100>:   lsl     r2, r5, #1
   0x0001a058 <+104>:   add     r2, r2, #4
   0x0001a05c <+108>:   add     r2, r2, #11
   0x0001a060 <+112>:   bic     r2, r2, #7
   0x0001a064 <+116>:   sub     r2, r2, #6
   0x0001a068 <+120>:   sub     r12, r12, r2
   0x0001a06c <+124>:   ldr     r7, [r3, #104]  ; 0x68
   0x0001a070 <+128>:   cmp     r12, r7
   0x0001a074 <+132>:   bhi     0x1a07c <_SPmisc_alloc+140>
   0x0001a078 <+136>:   udf     #0
   0x0001a07c <+140>:   str     r0, [r12, #-6]
   0x0001a080 <+144>:   mov     r4, r12
   0x0001a084 <+148>:   bic     r12, r12, #7
   0x0001a088 <+152>:   bx      lr
   0x0001a08c <+156>:   mov     r6, #308        ; 0x134
   0x0001a090 <+160>:   mov     r2, #12
   0x0001a094 <+164>:   b       0x198a4 <_SPksignalerr>

(gdb) disas 0x10071bd4,+0x30
Dump of assembler code from 0x10071bd4 to 0x10071c04:
   0x10071bd4:  b       0x10071be0
   0x10071bd8:  ldr     lr, [r3, #260]  ; 0x104
   0x10071bdc:  blx     lr
   0x10071be0:  mov     r6, r4
   0x10071be4:  mov     r4, #51 ; 0x33
   0x10071be8:  mov     r5, #296        ; 0x128
   0x10071bec:  ldr     lr, [r3, #652]  ; 0x28c
   0x10071bf0:  blx     lr
   0x10071bf4:  str     r4, [r10, #-4]!
   0x10071bf8:  mov     r5, r4
   0x10071bfc:  ldr     r4, [r10, #8]
   0x10071c00:  mov     r2, #8