Interlisp / medley

The main repo for the Medley Interlisp project. Wiki, Issues are here. Other repositories include maiko (the VM implementation) and Interlisp.github.io (web site sources)
https://Interlisp.org
MIT License
378 stars 19 forks source link

Stack overflow in loop that returns multiple values #19

Open masinter opened 4 years ago

masinter commented 4 years ago

Here’s a hypothesis:

The bug is in MULTIPLE-VALUE-CALL, it isn’t popping the stack properly when it receives more than a single value.

If this is true, then it is a total accident that I tripped over this:

I used SEE of a UTF8 encoded file to test my conversion interface, SEE ends up calling PFCOPYBYTES to interpret font changes in case it is a Lisp source file (which my file is not). PFCOPYBYTES takes start and end byte positions, because it is typically called from PF not for the whole file (as for SEE) but to pick out the definition of functions at particular byte positions as provided by a filemap.

PFCOPYBYTES is set up to deal with the fact that in the NS world (and other external format worlds) there is not a one-to-one correspondence between byte positions and character positions. That’s not an issue for most applications, but it is crucial for the PF invocation.

The low-level character readers hang off the open macro \NSIN, which takes an argument that says whether the number of bytes read should also be returned with the code. In the NS case, the second value is set to a local variable in the inline compilation. But in the case where there is a call out to a different encoding, the encoding-function is given a flag that says it should return the number of bytes it read, as a second value. That’s what’s happening in this case, and since it is repeated many times in a loop and if the stack isn’t cleaned up properly on each invocation, we would eventually get the overflow.

Putting aside this possible bug, this is an unfortunate implementation of PFCOPYBYTES—and according to the edit dates, it is probably something that I did for NS about 25 years ago, screwing up a simpler implementation that Larry had originally done about 10 years before that.

The problem is that it exports the details of a very low-level macroed interface to a very high-level function. The more direct implementation, given that bytes and characters are not 1-to-1, would have been to test the actual file ptr at each character, instead of the over-eager optimization that tries to second guess the counting with simple arithmetic. O better still, to recognize this as a first-class conceptual issue by adding a flag argument to READC and READCCODE indicating whether the bytes-read should also be returned as a second value.

On Aug 2, 2020, at 7:11 PM, Ron Kaplan ron.kaplan@post.harvard.edu wrote:

Part of the mystery is that the stack overflow doesn’t happen until it has read about 20K characters. So somehow this must be growing over time, even though it doesn’t look like anything should affect that frame from inside the loop in PFCOPYBYTES. The loop calls \NSIN and \OUTCHAR, and in this case the NSIN does an apply* to \JISIN. If you put a break on \JSIN, I wonder if the stack frame is big from the get go, or whether it grows over repeated calls.

On Aug 2, 2020, at 6:41 PM, Ron Kaplan ron.kaplan@post.harvard.edu wrote:

The last one is from the RESETLST inside PFCOPYBYTES? The other ones from the WITH-OPEN-FILEs in the little test routine?

How do you read the length of that frame, is it the difference between 4c22 and f092?

On Aug 2, 2020, at 5:53 PM, Nick Briggs nicholas.h.briggs@gmail.com wrote:

So this is what I see when I use the debug tools to dump the stack -- seems to reflect what URAID says, but, near the end of the stack,

4c22: SI:UNWIND-PROTECT alink: 0x4c10, clink: 0x0000, next: 0xf092

that SI:UNWIND-PROTECT appears to have allocated a huge amount of stack space, for reasons I don't understand.

masinter commented 3 years ago

Discussion in https://github.com/Interlisp/medley/pull/128 has some more context on PFCOPYBYTES

masinter commented 3 years ago

@rmkaplan gave example If you apply USEE to Unicode/Xerox/XCCS-0,41-50,340-344,356-361.TXT, you should get a stack overflow. Not because of any particular recursion, but because the stack is smashed. Or maybe Unicode/Xerox/XCCS-JIS.TXT

This Unicode file has a lot of non-ascii characters, that’s where this was first encountered. I don’t remember if that matters, it may be that running this on a large ascii file will provoke in the same way. There is probably a message thread on this somewhere.

The problem is that there is an external format, so the XCCS flag in the stream is turned off, and there is a closed call to the format’s character-reading function that has to return the number of bytes read as a second value USEE.txt

masinter commented 3 years ago

trying to cut down to a small reproducible problem image gives a stack overflow after 14 screenfuls if compiled but not cl:compile

nbriggs commented 3 years ago

The original USEE code fails on a largish printable ASCII file too, so it might let you rule out some of the paths.

rmkaplan commented 3 years ago

Also, I would be surprised if the linefeed stuff and the \OUTCHAR make a difference, since those don’t have anything to do with multiple values. My bet is that it would fail if you just read the file and throw away the characters.

Quite interesting that the compiler makes a difference.

On May 8, 2021, at 1:40 PM, Larry Masinter @.***> wrote:

trying to cut down to a small reproducible problem https://user-images.githubusercontent.com/1116587/117552857-d0891500-b002-11eb-9241-4a2ea6ac7836.png gives a stack overflow after 14 screenfuls if compiled but not cl:compile

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/19#issuecomment-835504289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJOZSD37NVM4BPVMHQDTMWOUTANCNFSM4QOJVGOA.

masinter commented 3 years ago

image

rmkaplan commented 3 years ago

On the theory that this doesn’t have to do with what's in the actual INCCODEFN, you might try replacing that with just (CL:VALUES (BIN SRCFIL) 0).

On May 8, 2021, at 2:40 PM, Larry Masinter @.***> wrote:

https://user-images.githubusercontent.com/1116587/117554243-45f8e380-b00b-11eb-9e73-c9533b891514.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/19#issuecomment-835527221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJJUTNTAHLPSKQ4KVP3TMWVURANCNFSM4QOJVGOA.

masinter commented 3 years ago

STACKBUG.txt

image

masinter commented 3 years ago

The multiple-values-return STKBUG1 image

interesting the MISCN opcode ... the problem doesn't occur if you don't compile STKBUG1 or if you CL:COMPILE(STACKBUG)

nbriggs commented 3 years ago

That looks odd. The MISCN sub opcodes are 1 for VALUES and 6 for VALUES_LIST.

masinter commented 3 years ago

What the CL:VALUES opcode is doing is walking up the control stack, looking at what the next opcodes are doing and seeing if they're prepared to accept multiple values. If it sees a [RETURN] it goes up another frame. It traces jumps! It looks for a function call to \MVLIST, but if it sees an UNBIND it calls a C

/*  Simulate the effect of UNBIND on a frame, to back us up     */
/*  to where we ought to be when we return multiple values.     */

If I use IL:COMPILE I see an UNBIND right after the APPLYFN, but if I use CL:COMPILE it moves things around so the UNBIND happens AFTER the \MVLIST call. image now

LispPTR *N_OP_unbind(register LispPTR *stack_pointer) {
  register DLword num;     /* number of unbind sot */
  register LispPTR *ppvar; /* pointer to last PVAR slot. */
  register DLword i;       /* temporary for control */
  register LispPTR value;

#ifdef TRACE
  printPC();
  printf("TRACE: N_OP_unbind()\n");
#endif

  /* now, stack_pointer points the latter part in slot */
  for (; !(*--stack_pointer & 0x80000000);)
    ; /* scan (until MSB == 1) */

  value = *stack_pointer;
  num = (DLword) ~(value >> 16);
  ppvar = (LispPTR *)(PVar + 2 + GetLoWord(value));
  value = 0xffffffff;
  for (i = 0; i < num; i++) { *--ppvar = value; }
  return (stack_pointer);
}

and

void simulate_unbind(FX2 *frame, int unbind_count, FX2 *returner) {
  int unbind;
  LispPTR *stackptr = (LispPTR *)(Stackspace + frame->nextblock);
  for (unbind = 0; unbind < unbind_count; unbind++) {
    register int value;
    register LispPTR *lastpvar;
    int bindnvalues;
    for (; ((int)*--stackptr >= 0);)
      ; /* find the binding mark */
    value = (int)*stackptr;
    lastpvar = (LispPTR *)((DLword *)frame + FRAMESIZE + 2 + GetLoWord(value));
    ;
    bindnvalues = (~value) >> 16;
    for (value = bindnvalues; --value >= 0;) { *--lastpvar = 0xffffffff; }
    /* This line caused \NSMAIL.READ.HEADING to smash memory, */
    /* so I removed it 21 Jul 91 --JDS.  This was the only    */
    /* difference between this function and the UNWIND code   */
    /* in inlineC.h                       */
    /*  MAKEFREEBLOCK(stackptr, (DLword *)stackptr-nextblock); */
  }
  if (returner)
    returner->fast = 0; /* since we've destroyed contiguity */
                        /* in the stack, but that only
                           matters if there's a return. */
}

don't look that similar

masinter commented 3 years ago

why lookee here: https://github.com/Interlisp/medley/issues/101

could this be the same problem?

nbriggs commented 3 years ago

It could certainly be another problem with sloppy code not handling correctly cases where the top bit is set in the 32-bit word. I should be able to look at this tomorrow.

masinter commented 3 years ago

image might be helpful

masinter commented 3 years ago

. WHO CALLS \SIMULATE.UNBIND leads me to image

nbriggs commented 3 years ago

I fixed the mvs.c simulate_unbind() so that it frees the frame extension that it just got rid of -- replacing the nextblock of the frame with a free block made from where the stack pointer ended up, and stretching as far as the old nextblock used to point to.

It made no difference to the way the code failed.

masinter commented 3 years ago

I tried starting with gdb ldex and stop in simulate_unbind and various other places and I got the stack overflow without hitting any bbreakpoints. What am I missing. Do I need to recompile with -O0 ?

nbriggs commented 3 years ago

The compiler does inlining, so yes, probably -O0 would be best. I use lldb rather than gdb, since it was compiled (on the Mac) with clang. lldb knows how to set the breakpoints on inlined code -- there were three places, I think.

The one thing that I think is different between the Lisp code and the C code for simulate_unbind is that the Lisp code sets the slow-return on either the passed in returnee, or the function referenced by this frame's ALINK. The C code doesn't do that in either the actual or simulated unbind, it only sets the slow flag if the returnee was passed.

On May 11, 2021, at 10:28 PM, Larry Masinter @.***> wrote:

I tried starting with gdb ldex and stop in simulate_unbind and various other places and I got the stack overflow without hitting any bbreakpoints. What am I missing. Do I need to recompile with -O0 ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/19#issuecomment-839454165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6DAWLJ26BCACQAREDRHUTTNIGWHANCNFSM4QOJVGOA.

masinter commented 3 years ago

the mklist in values that creates the list -- what happens when the free list is empty, and it needs to GC? I was stepping through the code looking to see if there was some off-by-1 that would cause some stack use to grow each time, but maybe it isn't every time. Put a

nbriggs commented 3 years ago

It happens every time. I added a printf() to the simulate_unbind() code to show the stack pointer on entry, and it's always 4 bytes more than the previous time. I also changed the STKBUG1 code so that it returns (CL:VALUES (EQ 0 (RAND 0 10)) 42) so that it doesn't run the system out of stack every time you try it.

masinter commented 3 years ago

i'm trying to make sense of this code. This is from OP_miscn arg_count is 2.

int OP_miscn(int misc_index, int arg_count) {
  register LispPTR *stk;
  register int result;
  static LispPTR args[255];

  /* Put the Args into a Vector */

  args[0] = NIL_PTR;
  stk = ((LispPTR *)CurrentStackPTR) + 1;

  {
    register int arg_num = arg_count;
    if (arg_num > 0) {
      *stk++ = (LispPTR)TopOfStack;
      while (arg_num > 0) args[--arg_num] = *--stk;
    }
  }

then leaving out irrelevant bits

  result = values(arg_count, args);
  PC += 3;
  CurrentStackPTR = (DLword *)(stk - 1);
  TopOfStack = (LispPTR)result;
  return (0);

as if it were returning from OP_miscn. But it's not. In our fail case, the values function has also already walked back the stack.

masinter commented 3 years ago

for reproducing problem stackbug.tar.gz I'm putting up the "help wanted" tag to see if we can get a fresh set of eyes, even though there's a lot of context

hjellinek commented 4 days ago

I wrote this to reproduce the problem.

(DEFUN TEST-BOMB (N BOMB-AT)
      (FLET ((MULTI-VALUE-RETURNER (I) (IF (= I BOMB-AT) (IL:ERROR "bomb") (VALUES 1 2 3 4))))
        (DOTIMES (I N) (FORMAT T "~d " I)
             (MULTIPLE-VALUE-BIND (A B C D)
                          (MULTI-VALUE-RETURNER I)))))

Call it by (test-bomb 1024 1023) once. if Medley doesn't hang, call it again. In my experience it will count up to no more than ~40, then Medley will hang.

nbriggs commented 4 days ago

Compiled (which compiler?) or interpreted? I'm running it interpreted -- and I haven't managed to get it to hang, it always counts up to 1023 and then does the IL:ERROR. I even tried (TEST-BOMB 16384 -1) and it ran through the end. I've tried it compiled with CL:COMPILE and IL:COMPILE - always completes, never hangs. It happens to be in a full.sysout built on the rmk36--Tedit-fifth-round-2 branch, maiko is a2f6215 (head of master branch) with some debug mods, but nothing affecting the interpreter.

hjellinek commented 4 days ago

Interesting... I was running it interpreted. I wonder what accounts for the difference in outcomes.

nbriggs commented 3 days ago

I wish I knew - I thought we had an understanding of what was going wrong, but I've also been unable to reproduce the problem with current maiko and medley full sysouts and Larry's STACKBUG test case from further back in this comment thread (May 8, 2021) - either interpreted or compiled.

hjellinek commented 3 days ago

By the way, I attached a macOS crash report to issue #1888.

nbriggs commented 3 days ago

Investigated the macOS crash report - it's a GC problem, possibly only indirectly related to the stack issue. It was trying to DELREF a cons page (at the end of N_OP_Cons), and took a trap in the reference count handling code -- which is a large macro so doesn't show up nicely in the backtrace, just the source line number of where the macro is invoked, not the line within the macro.

It might help to rewrite the macro as a function (which we could inline when not debugging) - but to catch this, I think we're going to need to have it either fail when we've got a debugger attached, or if it has failed into uraid with a fault can attach a debugger (lldb on macOS) after the fact.

hjellinek commented 3 days ago

Since it sounds like my crash is unrelated to this issue (#19), I will reopen issue #1888 and quote your comment above.