Interlisp / medley

The main repo for the Medley Interlisp project. Wiki, Issues are here. Other repositories include maiko (the VM implementation) and Interlisp.github.io (web site sources)
https://Interlisp.org
MIT License
376 stars 19 forks source link

Reliable bus error - Uraid #1122

Closed rmkaplan closed 1 year ago

rmkaplan commented 1 year ago

In a release sysout (or any other Full), make or get a fuller.database (run scripts/loadup-db.sh)

Restart the sysout, (LOAD 'fuller.database) from wherever it is.

. SHOW PATHS TO TEDIT (TEDIT probably isn't significant, anything else that is called from somewhere)

You eventually fall into URAID with an unrecoverable bus error. The stack ends in \MAIKO.DORECLAIM. Above on the stack is a a deep recursion of FOREST/BREAK/CYCLES, presumably leading to some interaction of garbage collection and stack overflow.

The recursion may be correct (it is basically trying to figure out some sort of massive graph) or it may be a bad looping algorithm.

Either way, the bug is that you end up in a nonrecoverable state.

nbriggs commented 1 year ago

If I run scripts/loadup-db.sh it fails when loading GITFNS with

In ERROR:
Can't find a clone directory for NOTECARDS
NIL
rmkaplan commented 1 year ago

Sorry about that, an edit got lost.

I put out a PR for an update that suppresses the clone-not-found error in the default case.

But before that you can also just RETFROM(GIT-MAKE-PROJECT).

On Mar 30, 2023, at 12:26 AM, Nick Briggs @.***> wrote:

If I run scripts/loadup-db.sh it fails when loading GITFNS with

In ERROR: Can't find a clone directory for NOTECARDS NIL — Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/1122#issuecomment-1489823483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJL4FV7R7WZJPWVLRKTW6UYQRANCNFSM6AAAAAAWMWUZ3E. You are receiving this because you authored the thread.

nbriggs commented 1 year ago

It doesn't drop into URAID for me:

image

But it's not clear that the result is correct, since if i do . SHOW PATHS TO PAGEFULLFN the result is the same as . SHOW PATHS TO TEDIT

image

nbriggs commented 1 year ago

(or at least pretty close)

masinter commented 1 year ago

i had a similar experience. The simple way to segregate material is to divide up the database into sets. not one per file but some larger categories sources, library, lispusers, ...

I even ran it with -nogreet

rmkaplan commented 1 year ago

It is a separate question whether there should be separate databases for all of lispusers and all of library, in addition to the current fuller.database that includes sources and just the other packages that happen to be loaded into the full sysout.

Independently, whatever is going on should not end up in an unrecoverable URAID. You can see this (or at least I can see it) in the sysout that I stored in my drop box, at the link below. Type
. SHOW PATHS TO TEDIT

https://www.dropbox.com/s/purjpcjk9zq49fj/BAD.SYSOUT?dl=0 BAD.SYSOUT dropbox.com

On Mar 30, 2023, at 7:13 PM, Larry Masinter @.***> wrote:

i had a similar experience. The simple way to segregate material is to divide up the database into sets. not one per file but some larger categories sources, library, lispusers, ...

I even ran it with -nogreet

— Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/1122#issuecomment-1491190558, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJOJQJJMDNWQJMDKB3DW6Y4VRANCNFSM6AAAAAAWMWUZ3E. You are receiving this because you authored the thread.

masinter commented 1 year ago

it was a feature that when the user changes, it's supposed to offer to undo your GREET and do mine. I don't have a WMEDLEY, my whereis.hash is a different version...

I tried your sysout and SHOW PATHS and I got unrecoverable URAIDs, but for various reasons: bad refcount and an arrayblock error....

It seems you have "SHOW PATHS" set up to call grapher.

I noticed that if I restrict the scope so the tree was relatively finite

. SHOW PATHS FROM SPY.BUTTON AMONG ANY ON SPY

I did get a tree display I remember turning that off early, but hadn't tried turning it on again in the last two years. I didn't look too hard but it seems that GRAPHER is going to some effort to avoid GC overflows... will need to look further, but try turning off the grapher/masterscpe feature for now.

image with a GC error.

masinter commented 1 year ago
rmkaplan commented 1 year ago

I didn’t recognize that the reason Nick didn’t see it in the Release sysout was that it wasn’t in the mode of trying to produce the graph.

I don’t know why or where the setting is in my environment that is turning that on. I don’t see anything in my Init, but that’s a separate issue. (The problem with you seeing WMEDLEY error may have been fixed in the GITFNS update—the BAD sysout was made before the most recent update of GITFNS. But … )

The stack overflow is likely correct, given the complexity of the relationships. Stack overflows sometimes go into Uraid, but usually in a recoverable Hard Reset way. But this is fatal.

I don’t see the Uraid message that we normally see when there is an error in uninterrubtable code. So that may not be the issue. The message I see is always about the GC trying to decrement a 0 reference count.

On Apr 2, 2023, at 9:18 AM, Larry Masinter @.***> wrote:

Looks like a stack overflow, likely in the middle of something "uninterruptible" (is or should be). simple stack overflow cases can be error prone. using grapher to show a call tree should have some limits on the maximum size/depth there's a variable of functions not to show paths through -- that should be updated — Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/1122#issuecomment-1493383140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJO5P2KQ4NEWFAQW76LW7GRGJANCNFSM6AAAAAAWMWUZ3E. You are receiving this because you authored the thread.

masinter commented 1 year ago

I believe just loading Lispusers BROWSER will change what SHOW PATHS does.

I'm not sure of what is happening, but there are several reasons to believe that the handling of resource exhaustion with Medley 3.5 hasn't been tested or debugged.

In the meanwhile,

nbriggs commented 1 year ago

Could you check the SHA checksum of BAD.SYSOUT -- I see

% shasum ~/Downloads/BAD.SYSOUT
ed42c8ebbf1e96b45fb587c475afec99450c48ab  /Users/briggs/Downloads/BAD.SYSOUT

and the sysout won't start -- it puts up the initial screen image (wrong dimensions; garbled) but then never really starts running.

rmkaplan commented 1 year ago

Since Larry determined that you need to have BROWSER.LCOM loaded in order to get the graph, I was able to get a cleaner failure in the release sysout without whatever extra junk was in earlier sysout and without the implicit load of BROWSER in my Init file.

So, I ran the release sysout with -NOGREET, loaded library/BROWSER.LCOM from the release file set, then loaded fuller.database, and did the . SHOW PATHS TO TEDIT. And it failed with a bad arrayblock in the garbage collector.

On Apr 2, 2023, at 8:28 PM, Nick Briggs @.***> wrote:

Could you check the SHA checksum of BAD.SYSOUT -- I see

% shasum ~/Downloads/BAD.SYSOUT ed42c8ebbf1e96b45fb587c475afec99450c48ab /Users/briggs/Downloads/BAD.SYSOUT and the sysout won't start -- it puts up the initial screen image (wrong dimensions; garbled) but then never really starts running.

— Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/1122#issuecomment-1493592489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJKWEA6V64DFKHWTHYLW7I7XTANCNFSM6AAAAAAWMWUZ3E. You are receiving this because you authored the thread.

rmkaplan commented 1 year ago

But still it would be good to figure out why the unlimited depth in this situation leads to an unrecoverable crash as opposed to a recoverable stack overflow.

masinter commented 1 year ago

1119 #1159 go a long way to fix hardreset . Recovery from stack overflow without RAID needs a bigger cushion of stack space and/or reducing the stack space needed for a break and/or not letting other (unnecessary?) processes run when handling an overflow

masinter commented 1 year ago

1199 and #1085 are the issues that are left. Closing this one.