cosmos72 / stmx

High performance Transactional Memory for Common Lisp
http://stmx.org/
241 stars 14 forks source link

Creating threads with (sb-thread:make-thread) - part 1 [was: Basic TFifo test fails.] #7

Closed ghost closed 10 years ago

ghost commented 10 years ago

Hi Massimiliano,

following code throws errors in SBCL:

(in-package #:tm-test)

(defvar test-fifo (make-instance 'tfifo))

(defun loop1 () (take test-fifo) (loop1))

(defun loop2 (n) (put test-fifo n) (loop2 (1+ n)))

(defun take-thread () (sb-thread:make-thread #'loop1 :name "BFS"))

(defun put-thread () (sb-thread:make-thread #'loop2 :name "BFS" :arguments '(1)))

There are two types of error messages popping up. The first is the most common one. The second one only showed up once:

I.)

The value NIL is not of type STMX::TLOG. [Condition of type TYPE-ERROR]

Restarts: 0: [ABORT] Abort thread (#<THREAD "BFS" RUNNING {1003DF86E3}>)

Backtrace: 0: (STMX::%RUN-SW-ATOMIC #<CLOSURE (LAMBDA NIL :IN STMX.UTIL:TAKE) {1003FADB4B}>) 1: (TM-TEST::LOOP1) 2: ((FLET #:WITHOUT-INTERRUPTS-BODY-1226 :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)) 3: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)) 4: ((FLET #:WITHOUT-INTERRUPTS-BODY-660 :IN SB-THREAD::CALL-WITH-MUTEX)) 5: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE) {7FFFF50AEC9B}> #<SB-THREAD:MUTEX "thread result lock" owner: #<SB-THREAD:THR.. 6: (SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE #<SB-THREAD:THREAD "BFS" RUNNING {1003DF86E3}> #S(SB-THREAD:SEMAPHORE :NAME "Thread setup semaphore" :%COUNT 0 :WAITCOUNT 0 :MUTEX #<SB-THREAD:MUTEX (fre.. 7: ("foreign function: call_into_lisp") 8: ("foreign function: new_thread_trampoline")

II.)

BUG! Transaction 4892 called (retry), but no TVARs to wait for changes. This is a bug either in the STMX library or in the application code. Possible reason: some code analogous to (atomic (retry)) was executed. Such code is not allowed, because at least one TVAR or one TOBJ slot must be read before retrying an ATOMIC block. [Condition of type SIMPLE-ERROR]

Restarts: 0: [ABORT] Abort thread (#<THREAD "BFS" RUNNING {10060F0BC3}>)

Backtrace: 0: (STMX::LISTEN-TVARS-OF #<STMX::TLOG v-2 {1002CB4E33}>) 1: (STMX::WAIT-TLOG #<STMX::TLOG v5306 {1002CB4E33}>) 2: (STMX::%RUN-SW-ATOMIC #<CLOSURE (LAMBDA NIL :IN TAKE) {1002DF1C0B}>) 3: (LOOP1) 4: ((FLET #:WITHOUT-INTERRUPTS-BODY-1226 :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)) 5: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)) 6: ((FLET #:WITHOUT-INTERRUPTS-BODY-660 :IN SB-THREAD::CALL-WITH-MUTEX)) 7: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE) {7FFFF61F6C9B}> #<SB-THREAD:MUTEX "thread result lock" owner: #<SB-THREAD:THR.. 8: (SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE #<SB-THREAD:THREAD "BFS" RUNNING {10060F0BC3}> #S(SB-THREAD:SEMAPHORE :NAME "Thread setup semaphore" :%COUNT 0 :WAITCOUNT 0 :MUTEX #<SB-THREAD:MUTEX (fre.. 9: ("foreign function: call_into_lisp") 10: ("foreign function: new_thread_trampoline")

Any idea what I'm doing wrong?

Kind regards, Hans-J.

cosmos72 commented 10 years ago

Hello Hans,

you found a very subtle bug, either in STMX or in SBCL. I managed to reproduce your first error a few times on SBCL 1.1.15, then it disappeared and I could not reproduce it anymore, not even by cleaning ~/.cache/common-lisp/ and restarting SBCL.

Can you please provide the following information, to help me investigate it further:

Thanks,

Massimiliano

ghost commented 10 years ago

Hi Massimiliano,

here all the relevant information:

I.) STMX version seems to be the latest:

CL-USER> (ql:system-apropos "stmx") <SYSTEM stmx / stmx-stable-ef96af7a-git / quicklisp 2014-05-25> <SYSTEM stmx.test / stmx-stable-ef96af7a-git / quicklisp 2014-05-25>

II.) CPU is NOT Haswell architecture:

vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x25 cpu MHz : 1627.750 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 6787.70 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

III.) SBCL version is latest: SBCL 1.1.18

IV.) OS is latest Archlinux 64 Bit: Linux 3.14.5-1-ARCH #1 SMP PREEMPT Sun Jun 1 07:36:23 CEST 2014 x86_64 GNU/Linux

Hope this helps and good luck! :-)

Regards, Hans-J.

cosmos72 commented 10 years ago

Hello Hans-J,

commit 707d3dcf797d99fd14ebcdd79035a52c87b228c5 should fix this bug. The relevant change is:

diff --git a/stmx/main/tlog.lisp b/stmx/main/tlog.lisp
index b215f4d..57c891c 100644
--- a/stmx/main/tlog.lisp
+++ b/stmx/main/tlog.lisp
@@ -54,7 +54,9 @@ return LOG itself."
 (declaim (inline new-tlog))
 (defun new-tlog ()
   "Get a TLOG from pool or create one, and return it."
-  (let1 log (the tlog (nth-value 0 (fast-vector-pop-macro *tlog-pool* (make-tlog))))
+  (let1 log (or (fast-vector-pop *tlog-pool*)
+                (make-tlog))
+    (declare (type tlog log))
     (setf (tlog-read-version log) +invalid-version+)
     log))

Can you check if it fixes the problem also for you?

Regards,

Massimiliano

ghost commented 10 years ago

Hi Massimiliano,

it got a bit better. ;-)

I tried several times and the experience is mixed. A few times this error message showed up when I started take-thread and put-thread:

BUG! Transaction 794 called (retry), but no TVARs to wait for changes. This is a bug either in the STMX library or in the application code. Possible reason: some code analogous to (atomic (retry)) was executed. Such code is not allowed, because at least one TVAR or one TOBJ slot must be read before retrying an ATOMIC block. [Condition of type SIMPLE-ERROR]

Restarts: 0: [ABORT] Abort thread (#<THREAD "BFS" RUNNING {1005D51BF3}>)

Backtrace: 0: (STMX::LISTEN-TVARS-OF #<STMX::TLOG v1460 {1005D2F3B3}>) 1: (STMX::WAIT-TLOG #<STMX::TLOG v1510 {1005D2F3B3}>) 2: (STMX::%RUN-SW-ATOMIC #<CLOSURE (LAMBDA NIL :IN STMX.UTIL:TAKE) {100600FE2B}>) 3: (TM-TEST::LOOP1) 4: ((FLET #:WITHOUT-INTERRUPTS-BODY-1226 :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)) 5: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)) 6: ((FLET #:WITHOUT-INTERRUPTS-BODY-660 :IN SB-THREAD::CALL-WITH-MUTEX)) 7: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE) {7FFFF50AEC9B}> #<SB-THREAD:MUTEX "thread result lock" owner: #<SB-THREAD:THR.. 8: (SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE #<SB-THREAD:THREAD "BFS" RUNNING {1005D51BF3}> #S(SB-THREAD:SEMAPHORE :NAME "Thread setup semaphore" :%COUNT 0 :WAITCOUNT 0 :MUTEX #<SB-THREAD:MUTEX (fre.. 9: ("foreign function: call_into_lisp")

In other situations it just locked up.

When I only start put-thread it will shutdown after a short while as I can see in htop. When I want to list all the running threads in the REPL with "(sb-thread:list-all-threads)" the command does not return and hangs up the REPL. Peeking or taking test-fifo while put-thread is running doesn't work. This seems to be consistent and could be used as a test case for troubleshooting.

Regards, Hans-J.

cosmos72 commented 10 years ago

Hello Hans-J,

then it means the first error you reported "The value NIL is not of type STMX::TLOG" has disappeared. good. This is what the fix was supposed to do.

About the two remaining issues: 1) When you only start put-thread, you're basically executing something like

(let ((fifo nil))
  (loop for n from 0
    (push n fifo)))

No wonder it shutdowns/hangs after a while, as it keeps allocating until SBCL runs out of memory. Remember that TFIFO is unbounded... If you look, you will probably see SBCL low-level debugger (LDB) reporting an out-of-memory error and waiting for user input - if you use emacs & slime, check the inferior-lisp buffer

2) I could not yet reproduce the error "BUG! Transaction 794 called (retry), but no TVARs to wait for changes." I will leave SBCL running on your example and see if it happens to me as well - I will open a separate bug report for this, as it is a different problem.

ghost commented 10 years ago

Hi Massimiliano,

you are right. I was just looking at htop which updates in certain intervals and it didn't seem there would be a memory issue. But actually heap got exhausted very, very quickly. And I thought STM is slow! ;-)

Thanks a lot for your support, Hans-J.

cosmos72 commented 10 years ago

I just opened issue #8 for the remaining error

cosmos72 commented 10 years ago

Solved the remaining error as well - see issue #8 for details. The short answer is: you must call (bt:make-thread) instead of (sb-thread:make-thread)