Slowdown and segfault in a tail-recursive function

problem

This gherkin program gets progressively slower and then bash dies before the program finishes:

((fn (x)
   (if (eq? x 0)
     'done
     (do (println x)
          (recur (- x 1)))))
 1000)

For me, the segfault happens after 463 is printed. So, there are really two problems, and I suspect they're related:

the slowdown
the segfault
setup
OS: Debian GNU/Linux
kernel: 3.2.0-4-amd64
bash: 4.2.37(1)-release
gherkin: cceee79
observations

gdb

I ran the program with gdb so I could look at a stack trace. To do this, I ran:

gdb --args bash gherkin -e "((fn (x) (if (eq? x 0) 'done (do (println x) (recur (- x 1))))) 1000)" and typed run at the gdb console.

The segfault happened in the same place - after 463 is printed. Investigating with the backtrace command:

(gdb) backtrace
#0  0x0000000000441705 in ?? ()
#1  0x00000000004419f6 in ?? ()
#2  0x0000000000441a7f in ?? ()
#3  0x0000000000441b79 in ?? ()
#4  0x0000000000441bdb in ?? ()
#5  0x0000000000441c29 in ?? ()
#6  0x0000000000441cdb in ?? ()
#7  0x0000000000441d46 in ?? ()
#8  0x0000000000441d86 in ?? ()
#9  0x0000000000441dc6 in ?? ()
#10 0x0000000000441e06 in ?? ()
#11 0x0000000000441e8b in ?? ()
#12 0x0000000000441fe2 in ?? ()
#13 0x0000000000442309 in ?? ()
#14 0x0000000000442537 in ?? ()
#15 0x000000000044314a in evalexp ()
#16 0x0000000000450800 in ?? ()
#17 0x0000000000451b0e in ?? ()
#18 0x0000000000454502 in ?? ()
#19 0x000000000045543a in ?? ()
#20 0x0000000000456b2c in ?? ()
#21 0x0000000000456fe4 in expand_string_assignment ()
#22 0x0000000000457064 in expand_assignment_string_to_string ()

The trace continues for tens of thousands of frames and the function call pattern appears cyclical.

attempt to reduce

This simpler program continues into the negative integers, but also exhibits progressive slowness and eventually segfaults bash after printing -172:

((fn (x)
   (println x)
   (recur (- x 1))))

Running with gdb the top of my backtrace looks like this:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff764923a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff764923a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff76b72fb in mbrtowc () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000000000049117d in mbschr ()
#3  0x0000000000460933 in valid_array_reference ()
#4  0x0000000000457405 in ?? ()
#5  0x000000000045224a in ?? ()
#6  0x0000000000454502 in ?? ()
#7  0x000000000045543a in ?? ()
#8  0x0000000000456b2c in ?? ()
#9  0x0000000000456fe4 in expand_string_assignment ()
#10 0x0000000000457064 in expand_assignment_string_to_string ()
#11 0x0000000000461489 in ?? ()
#12 0x0000000000457424 in ?? ()
#13 0x000000000045224a in ?? ()
#14 0x0000000000454502 in ?? ()
#15 0x000000000045543a in ?? ()
#16 0x000000000045619f in ?? ()
#17 0x0000000000456b2c in ?? ()
#18 0x0000000000456ef5 in cond_expand_word ()
#19 0x000000000043416e in ?? ()
#20 0x0000000000434269 in ?? ()
#21 0x0000000000436c70 in execute_command_internal ()
#22 0x000000000043a2fe in execute_command ()

theory

I think we're blowing bash's function call stack. One big difference between the two programs is that there are fewer recursive calls to lisp_eval in the program without the if and eq?.

To test this theory, I decreased my process stack limit to 1000 with bash's ulimit builtin: ulimit -s 1000. I would expect the program to fail sooner. And it did, after printing 938:

...
938
Segmentation fault

fix suggestion 1 - trampoline eval

I think we need to remove recursive lisp_eval calls from the interpreter. We can achieve this with a loop inside lisp_eval and variable containing the expression to evaluate next. This will be a lot of work, and implies folding everything lisp_eval currently touches inside the loop. See this StackOverflow answer for details around this approach: http://stackoverflow.com/a/6005989

fix suggestion 2 - gherkin=stackVM+compiler

This would involve designing a set of opcodes, a virtual stack machine implementing them, and a compiler targeting the stack machine. See this paper for what's involved: http://dl.acm.org/citation.cfm?id=802810

Probably overkill for now.

alandipert / gherkin