Open masinter opened 4 years ago
Discussion in https://github.com/Interlisp/medley/pull/128 has some more context on PFCOPYBYTES
@rmkaplan gave example If you apply USEE to Unicode/Xerox/XCCS-0,41-50,340-344,356-361.TXT, you should get a stack overflow. Not because of any particular recursion, but because the stack is smashed. Or maybe Unicode/Xerox/XCCS-JIS.TXT
This Unicode file has a lot of non-ascii characters, that’s where this was first encountered. I don’t remember if that matters, it may be that running this on a large ascii file will provoke in the same way. There is probably a message thread on this somewhere.
The problem is that there is an external format, so the XCCS flag in the stream is turned off, and there is a closed call to the format’s character-reading function that has to return the number of bytes read as a second value USEE.txt
trying to cut down to a small reproducible problem gives a stack overflow after 14 screenfuls if compiled but not cl:compile
The original USEE code fails on a largish printable ASCII file too, so it might let you rule out some of the paths.
Also, I would be surprised if the linefeed stuff and the \OUTCHAR make a difference, since those don’t have anything to do with multiple values. My bet is that it would fail if you just read the file and throw away the characters.
Quite interesting that the compiler makes a difference.
On May 8, 2021, at 1:40 PM, Larry Masinter @.***> wrote:
trying to cut down to a small reproducible problem https://user-images.githubusercontent.com/1116587/117552857-d0891500-b002-11eb-9241-4a2ea6ac7836.png gives a stack overflow after 14 screenfuls if compiled but not cl:compile
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/19#issuecomment-835504289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJOZSD37NVM4BPVMHQDTMWOUTANCNFSM4QOJVGOA.
On the theory that this doesn’t have to do with what's in the actual INCCODEFN, you might try replacing that with just (CL:VALUES (BIN SRCFIL) 0).
On May 8, 2021, at 2:40 PM, Larry Masinter @.***> wrote:
https://user-images.githubusercontent.com/1116587/117554243-45f8e380-b00b-11eb-9e73-c9533b891514.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/19#issuecomment-835527221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJJUTNTAHLPSKQ4KVP3TMWVURANCNFSM4QOJVGOA.
The multiple-values-return STKBUG1
interesting the MISCN opcode ... the problem doesn't occur if you don't compile STKBUG1 or if you CL:COMPILE(STACKBUG)
That looks odd. The MISCN sub opcodes are 1 for VALUES and 6 for VALUES_LIST.
What the CL:VALUES opcode is doing is walking up the control stack, looking at what the next opcodes are doing and seeing if they're prepared to accept multiple values. If it sees a [RETURN] it goes up another frame. It traces jumps! It looks for a function call to \MVLIST, but if it sees an UNBIND it calls a C
/* Simulate the effect of UNBIND on a frame, to back us up */
/* to where we ought to be when we return multiple values. */
If I use IL:COMPILE I see an UNBIND right after the APPLYFN, but if I use CL:COMPILE it moves things around so the UNBIND happens AFTER the \MVLIST call. now
LispPTR *N_OP_unbind(register LispPTR *stack_pointer) {
register DLword num; /* number of unbind sot */
register LispPTR *ppvar; /* pointer to last PVAR slot. */
register DLword i; /* temporary for control */
register LispPTR value;
#ifdef TRACE
printPC();
printf("TRACE: N_OP_unbind()\n");
#endif
/* now, stack_pointer points the latter part in slot */
for (; !(*--stack_pointer & 0x80000000);)
; /* scan (until MSB == 1) */
value = *stack_pointer;
num = (DLword) ~(value >> 16);
ppvar = (LispPTR *)(PVar + 2 + GetLoWord(value));
value = 0xffffffff;
for (i = 0; i < num; i++) { *--ppvar = value; }
return (stack_pointer);
}
and
void simulate_unbind(FX2 *frame, int unbind_count, FX2 *returner) {
int unbind;
LispPTR *stackptr = (LispPTR *)(Stackspace + frame->nextblock);
for (unbind = 0; unbind < unbind_count; unbind++) {
register int value;
register LispPTR *lastpvar;
int bindnvalues;
for (; ((int)*--stackptr >= 0);)
; /* find the binding mark */
value = (int)*stackptr;
lastpvar = (LispPTR *)((DLword *)frame + FRAMESIZE + 2 + GetLoWord(value));
;
bindnvalues = (~value) >> 16;
for (value = bindnvalues; --value >= 0;) { *--lastpvar = 0xffffffff; }
/* This line caused \NSMAIL.READ.HEADING to smash memory, */
/* so I removed it 21 Jul 91 --JDS. This was the only */
/* difference between this function and the UNWIND code */
/* in inlineC.h */
/* MAKEFREEBLOCK(stackptr, (DLword *)stackptr-nextblock); */
}
if (returner)
returner->fast = 0; /* since we've destroyed contiguity */
/* in the stack, but that only
matters if there's a return. */
}
don't look that similar
why lookee here: https://github.com/Interlisp/medley/issues/101
could this be the same problem?
It could certainly be another problem with sloppy code not handling correctly cases where the top bit is set in the 32-bit word. I should be able to look at this tomorrow.
might be helpful
. WHO CALLS \SIMULATE.UNBIND leads me to
I fixed the mvs.c simulate_unbind() so that it frees the frame extension that it just got rid of -- replacing the nextblock of the frame with a free block made from where the stack pointer ended up, and stretching as far as the old nextblock used to point to.
It made no difference to the way the code failed.
I tried starting with gdb ldex and stop in simulate_unbind and various other places and I got the stack overflow without hitting any bbreakpoints. What am I missing. Do I need to recompile with -O0 ?
The compiler does inlining, so yes, probably -O0 would be best. I use lldb rather than gdb, since it was compiled (on the Mac) with clang. lldb knows how to set the breakpoints on inlined code -- there were three places, I think.
The one thing that I think is different between the Lisp code and the C code for simulate_unbind is that the Lisp code sets the slow-return on either the passed in returnee, or the function referenced by this frame's ALINK. The C code doesn't do that in either the actual or simulated unbind, it only sets the slow flag if the returnee was passed.
On May 11, 2021, at 10:28 PM, Larry Masinter @.***> wrote:
I tried starting with gdb ldex and stop in simulate_unbind and various other places and I got the stack overflow without hitting any bbreakpoints. What am I missing. Do I need to recompile with -O0 ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/19#issuecomment-839454165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6DAWLJ26BCACQAREDRHUTTNIGWHANCNFSM4QOJVGOA.
the mklist in values that creates the list -- what happens when the free list is empty, and it needs to GC? I was stepping through the code looking to see if there was some off-by-1 that would cause some stack use to grow each time, but maybe it isn't every time. Put a
It happens every time. I added a printf() to the simulate_unbind() code to show the stack pointer on entry, and it's always 4 bytes more than the previous time. I also changed the STKBUG1 code so that it returns (CL:VALUES (EQ 0 (RAND 0 10)) 42)
so that it doesn't run the system out of stack every time you try it.
i'm trying to make sense of this code. This is from OP_miscn arg_count is 2.
int OP_miscn(int misc_index, int arg_count) {
register LispPTR *stk;
register int result;
static LispPTR args[255];
/* Put the Args into a Vector */
args[0] = NIL_PTR;
stk = ((LispPTR *)CurrentStackPTR) + 1;
{
register int arg_num = arg_count;
if (arg_num > 0) {
*stk++ = (LispPTR)TopOfStack;
while (arg_num > 0) args[--arg_num] = *--stk;
}
}
then leaving out irrelevant bits
result = values(arg_count, args);
PC += 3;
CurrentStackPTR = (DLword *)(stk - 1);
TopOfStack = (LispPTR)result;
return (0);
as if it were returning from OP_miscn. But it's not. In our fail case, the values function has also already walked back the stack.
for reproducing problem stackbug.tar.gz I'm putting up the "help wanted" tag to see if we can get a fresh set of eyes, even though there's a lot of context
I wrote this to reproduce the problem.
(DEFUN TEST-BOMB (N BOMB-AT)
(FLET ((MULTI-VALUE-RETURNER (I) (IF (= I BOMB-AT) (IL:ERROR "bomb") (VALUES 1 2 3 4))))
(DOTIMES (I N) (FORMAT T "~d " I)
(MULTIPLE-VALUE-BIND (A B C D)
(MULTI-VALUE-RETURNER I)))))
Call it by (test-bomb 1024 1023)
once. if Medley doesn't hang, call it again. In my experience it will count up to no more than ~40, then Medley will hang.
Compiled (which compiler?) or interpreted? I'm running it interpreted -- and I haven't managed to get it to hang, it always counts up to 1023 and then does the IL:ERROR
. I even tried (TEST-BOMB 16384 -1)
and it ran through the end. I've tried it compiled with CL:COMPILE
and IL:COMPILE
- always completes, never hangs. It happens to be in a full.sysout built on the rmk36--Tedit-fifth-round-2 branch, maiko is a2f6215 (head of master branch) with some debug mods, but nothing affecting the interpreter.
Interesting... I was running it interpreted. I wonder what accounts for the difference in outcomes.
I wish I knew - I thought we had an understanding of what was going wrong, but I've also been unable to reproduce the problem with current maiko and medley full sysouts and Larry's STACKBUG test case from further back in this comment thread (May 8, 2021) - either interpreted or compiled.
By the way, I attached a macOS crash report to issue #1888.
Investigated the macOS crash report - it's a GC problem, possibly only indirectly related to the stack issue. It was trying to DELREF a cons page (at the end of N_OP_Cons), and took a trap in the reference count handling code -- which is a large macro so doesn't show up nicely in the backtrace, just the source line number of where the macro is invoked, not the line within the macro.
It might help to rewrite the macro as a function (which we could inline when not debugging) - but to catch this, I think we're going to need to have it either fail when we've got a debugger attached, or if it has failed into uraid with a fault can attach a debugger (lldb on macOS) after the fact.
Since it sounds like my crash is unrelated to this issue (#19), I will reopen issue #1888 and quote your comment above.
Here’s a hypothesis:
The bug is in MULTIPLE-VALUE-CALL, it isn’t popping the stack properly when it receives more than a single value.
If this is true, then it is a total accident that I tripped over this:
I used SEE of a UTF8 encoded file to test my conversion interface, SEE ends up calling PFCOPYBYTES to interpret font changes in case it is a Lisp source file (which my file is not). PFCOPYBYTES takes start and end byte positions, because it is typically called from PF not for the whole file (as for SEE) but to pick out the definition of functions at particular byte positions as provided by a filemap.
PFCOPYBYTES is set up to deal with the fact that in the NS world (and other external format worlds) there is not a one-to-one correspondence between byte positions and character positions. That’s not an issue for most applications, but it is crucial for the PF invocation.
The low-level character readers hang off the open macro \NSIN, which takes an argument that says whether the number of bytes read should also be returned with the code. In the NS case, the second value is set to a local variable in the inline compilation. But in the case where there is a call out to a different encoding, the encoding-function is given a flag that says it should return the number of bytes it read, as a second value. That’s what’s happening in this case, and since it is repeated many times in a loop and if the stack isn’t cleaned up properly on each invocation, we would eventually get the overflow.
Putting aside this possible bug, this is an unfortunate implementation of PFCOPYBYTES—and according to the edit dates, it is probably something that I did for NS about 25 years ago, screwing up a simpler implementation that Larry had originally done about 10 years before that.
The problem is that it exports the details of a very low-level macroed interface to a very high-level function. The more direct implementation, given that bytes and characters are not 1-to-1, would have been to test the actual file ptr at each character, instead of the over-eager optimization that tries to second guess the counting with simple arithmetic. O better still, to recognize this as a first-class conceptual issue by adding a flag argument to READC and READCCODE indicating whether the bytes-read should also be returned as a second value.
On Aug 2, 2020, at 7:11 PM, Ron Kaplan ron.kaplan@post.harvard.edu wrote:
Part of the mystery is that the stack overflow doesn’t happen until it has read about 20K characters. So somehow this must be growing over time, even though it doesn’t look like anything should affect that frame from inside the loop in PFCOPYBYTES. The loop calls \NSIN and \OUTCHAR, and in this case the NSIN does an apply* to \JISIN. If you put a break on \JSIN, I wonder if the stack frame is big from the get go, or whether it grows over repeated calls.
On Aug 2, 2020, at 6:41 PM, Ron Kaplan ron.kaplan@post.harvard.edu wrote:
The last one is from the RESETLST inside PFCOPYBYTES? The other ones from the WITH-OPEN-FILEs in the little test routine?
How do you read the length of that frame, is it the difference between 4c22 and f092?
On Aug 2, 2020, at 5:53 PM, Nick Briggs nicholas.h.briggs@gmail.com wrote:
So this is what I see when I use the debug tools to dump the stack -- seems to reflect what URAID says, but, near the end of the stack,
4c22: SI:UNWIND-PROTECT alink: 0x4c10, clink: 0x0000, next: 0xf092
that SI:UNWIND-PROTECT appears to have allocated a huge amount of stack space, for reasons I don't understand.