ITS sometimes hangs on shutdown and never returns to NTS DDT

eswenson1 commented 1 year ago

I've seen the issue on both a KA (pdp10-ka) and a DB (klh10) ITS systems. When you attempt to shut down ITS with LOCK you may run into similar console output:

DB ITS going down in 4:59
q
:KILL
*$$u
 LOGOUT EJS    0 16:29:58

DB ITS 1651 SYSTEM JOB USING THIS CONSOLE. 16:29:58

DB ITS 1651 GOING DOWN IN 00:04:56.   16:29:58

DB ITS 1651 NOT IN OPERATION AT 16:29:58
PFTHMG DRAGON CHANNA _DRGN_ TIMES  WRITE  16:30:00
PFTHMG DRAGON CHANNA LOGOUT TIMES  DELRNM 16:30:00
IT IS NOW  4:31:16 PM EDT, MONDAY, SEP 18, 2023

ITS does not return to NTS DDT but continues "running" indefinitely. Analysis of the situation shows that ITS is waiting for all TUTs to get written out to disk, but the code that writes out the TUTs is waiting for the QACTTM counter to become positive. QACTTM is set to -1, and then at various times, incremented (with AOS). Even during the "shutdown hang", it can be seen to increment past 0 (I've seen it go to 1 and 2), but always seems to get reset to -1, and possibly incremented to 1, before the TUT writing code checks it.

As a result, the TUTs are never written out, and because there may be one or more TUTs marked as needing to get written out, ITS will not execute the logic that returns to NTS DDT. It is waiting on the TUTs to get written out.

The issue appears to be a timing issue. At first, I noticed the issue on a pdp10-ka (KA) instance, where I had added disks to make the total number of disks 5. I thought it had something to do with the addition of the 5th disk. But then, I ran into the same issue with a newly-built KLH10 DB instance, where there had been no changes made -- the build was a pristine one.

There appears to be a negative interaction between the slow clock (which, when in the "shutdown" mode, is the one that checks to see if the TUTs are all written out before returning to NTS DDT), and the disk interrupts and fast clock used to ensure disk i/o is done.

I added a fix to SYSTEM; DISK >, which guarantees that TUTs will be written out if doing so was skipped over due to QACTTM being non-positive more than 10 times. This allows the system to shut down in the above situation.

eswenson1 commented 1 year ago

Also, if, when in the above situation you force a return to simulator command level (^) and quit the simulator. Upon restart, you may see disk errors (duplicate blocks, TUT errors):

.     ;@      SALV    PACK 2, UNIT #0
SHARES BLOCK WITH SOME OTHER FILE
.     ;@      SALV    PACK 2, UNIT #0
SHARES BLOCK WITH SOME OTHER FILE
SHARED BLOCKS
LOOKING FOR SHARED BLOCKS
.     ;@      XITS   0-764, TUT= 2
.     ;@      XITS   0-765, TUT= 2
.     ;@      SALV   0-764, TUT= 2
.     ;@      SALV   0-765, TUT= 2

*** ERROR *** SYSTEM MAY NOT BE BROUGHT BACK UP

Running the salvager before booting may result in messages like this:

l$ddt
t$salv bin
CHKR$G
SALVAGER.317
ACTIVE UNITS ARE:0123
UNIT #3 ID IS 1 PK #1
UNIT #2 ID IS 0 PK #0
UNIT #1 ID IS 3 PK #3
UNIT #0 ID IS 2 PK #2
USE MFD FROM UNIT:2
READ ALL BLOCKS OF ALL FILES?Y
GET USR DIRS FROM UNIT:2
.TEMP.; NO FILES, USER DIRECTORY DELETED
FILE UNPROTECTED IN OLD TUT - CHANNA;LOGOUT TIMES   PACK 0, UNIT #2
Block 7603
FILE UNPROTECTED IN OLD TUT - DRAGON;DRAGON HOARD   PACK 0, UNIT #2
Block 7547
FILE UNPROTECTED IN OLD TUT - DRAGON;CDATA  12      PACK 0, UNIT #2
Block 7540
TUT #0 27 1_2, PRINT?N
TUT #0 NEEDS UPDATING - WRITE?Y
TUT #2 25 1_2 3 1_0 3 0_1, PRINT?N
TUT #2 NEEDS UPDATING - WRITE?T
(Y OR N) Y
TUT #3 2 1_2, PRINT?N
TUT #3 NEEDS UPDATING - WRITE?Y
WRITE OUT CHANGES IN MFD?Y

DDT

eswenson1 commented 1 year ago

The slow clock runs this code:

SSLCK:  SKIPGE SHUTDN   ;SYSTEM "DEAD" =>
     PUSHJ P,COFFIN ;GO TO DDT AFTER RIGOR MORTIS SETS IN.
    SETCMB A,SCKPHS ;CHECK SYSTEM CHECKSUMS EVERY OTHER SLOW CLOCK TICK
    MOVE T,SUPCOP   ;WHICH IS ONCE PER SECOND.

As you can see, if SHUTDN is set (it is), then we call COFFIN:

COFFIN: MOVE T,NLOOSP       ;MUST SWAP OUT ALL LOOSE PAGES
    ADDB T,SWPOPR
    JUMPN T,CPOPJ       ;SWAPPAGE OUT STILL GOING ON
    MOVEI T,NQCHN+1+NQS-1
    SKIPGE QSGL(T)
     SOJG T,.-1
    SKIPL QSGL(T)       ;NOT OVER IF ANY DISK CHANNELS ARE ACTIVE
     POPJ P,
    MOVSI T,040000
    MOVE TT,QACTB   ;SEE IF DISK STUFF NOT OUT YET
    TDNE TT,QMDRO
     POPJ P,    ;MFD NOT OUT
    MOVSI A,-NQS
COFFI8: TDNE T,QTUTO(A)
     JRST COFFI9    ;TUT NEVER CAME IN
    TDNE TT,QTUTO(A)
     POPJ P,    ;TUT NOT OUT
COFFI9: AOBJN A,COFFI8
    MOVSI A,-QNUD
COFFI1: SKIPN QSNUD(A)
     JRST COFFI2
    TDNE TT,QSNLCN(A)
     POPJ P,    ;UFD NOT WRITTEN
COFFI2: AOBJN A,COFFI1

COFFIN checks to see if pending disk I/o is complete. ;t checks to see if the MFD has been written out (it has in my case), and then checks to see if the “TUT NOT OUT” — in other words that one of the TUTs have not been written out. We find this is the case, and DON’T proceed with the shutdown.

Looking at the TUTs will show that some of them (it varies, how many) need to get written out, preventing shutdown:

QTUTO looks like this:

44056:  020000714000
44057:  010000716000
44060:  004000720000
44061:  002000722000
44062:  001000724000

(Sometimes, I've seen it with only one TUT needing to get written out -- it all depends on what disks were updated during the bootload prior to shutdown).

eswenson1 commented 1 year ago

The reason why I believe this is a timing issue is because sometimes shutdown hangs and sometimes it doesn't. Sometimes shutdown takes a lot longer (2 minutes), sometimes shorter, sometimes within seconds.

If I put a breakpoint here (QTDW):

050550  337000 044214   QTDW:   SKIPG QACTTM
050551  254000 050616        JRST QUDW
050552  205400 777773       MOVSI Q,-NQS
050553  335010 044171   QTDW1:  SKIPGE QACT(Q)
050554  254000 050615        JRST QTDWA
050555  200450 046636       MOVE J,DCHBT(Q)
050556  331010 044056       SKIPL QTUTO(Q)
050557  616450 044056       TDNN J,QTUTO(Q)
050560  254000 050615        JRST QTDWA
050561  205600 240000       MOVSI H,240000
050562  612610 044056       TDNE H,QTUTO(Q)
050563  254000 050614        JRST QTDWA1
050564  200300 200666       MOVE TT,TIME        ;DON'T WRITE TUTS TOO OFTEN
050565  274310 044070       SUB TT,QTWRTM(Q)    ;BECAUSE THE TUT IS LOCKED WHILE IT'S BEING WRITTEN
050566  305300 000144       CAIGE TT,100.

When in the hang situation, we do get here periodically. All of the time QCCTM is either 0 or 777777777777, and we jump to QUDW (wait). We never get to the TUT write logic at 50564. The TUTs stay unwritten.

However, if I stop at breakpoints, continue from breakpoints, get the KA ITS REVIVED message, and repeat this a bunch of times, eventually, we do get to the TUT write logic and after N iterations (where N is not equal to TUT, but some multiple), we eventually write out all the TUTs. Continuing at that point gets to NTS DDT.

However, if I don’t put breakpoints at 50550 (and a few other places in that code fragment, above), we NEVER get to 50564 and the TUTs don’t get written out, and we never shut down.

eswenson1 commented 1 year ago

One time, my KA instance (pdp10-ka) wouldn't shut down. Examining the TUTs showed that just one needed to be written out. In order to confirm my theory that the unwritten TUT was preventing shutdown from proceeding,

Just to test my theory, I forced the TUT to have no flags in the LH, indicating it didn't need to be read in or written out:

sim> d 44062 000000724000
sim> c

And immediately, I got this:

SHUTDOWN COMPLETE
PI LEVEL 7 BUGDDT.  TYPE <ALTMODE>P TO CONTINUE.
YOU ARE NOW IN DDT.
BUGPC/   CAI COFFI4+1   $Q-2/   JRST COFFI7

As soon as I bashed the TUT bits for disk 4 to 0, the system shut down fine.

eswenson1 commented 1 year ago

For the written record, some more evidence:

The documentation on QACTTM is this (in SYSTEM;DISK >):

QACTTM  ; Activity TOO RECENT

044214  777777 777777   QACTTM: -1  ;SETOM ON READ OR WRITE DATA TRANSFER
                                ;AOS AT SSLCK IF + OK TO WRITE DIRS
                                ;IE DON'T TIE UP DISK TO WRITE DIR UNLESS THINGS ARE QUIET

When we are in the hung state, is sometimes 000000000000 or 7777777777777, but never > 0 (when the TUT checking logic runs). As a result, we always skip the disk writing code.

This logic:

050547  253400 050514   QMDWA:  AOBJN Q,QMDW
050550  337000 044214   QTDW:   SKIPG QACTTM
050551  254000 050616        JRST QUDW
050552  205400 777773       MOVSI Q,-NQS
050553  335010 044171   QTDW1:  SKIPGE QACT(Q)
050554  254000 050615        JRST QTDWA
050555  200450 046636       MOVE J,DCHBT(Q)
050556  331010 044056       SKIPL QTUTO(Q)
050557  616450 044056       TDNN J,QTUTO(Q)
050560  254000 050615        JRST QTDWA
050561  205600 240000       MOVSI H,240000
050562  612610 044056       TDNE H,QTUTO(Q)
050563  254000 050614        JRST QTDWA1
050564  200300 200666       MOVE TT,TIME        ;DON'T WRITE TUTS TOO OFTEN
050565  274310 044070       SUB TT,QTWRTM(Q)    ;BECAUSE THE TUT IS LOCKED WHILE IT'S BEING WRITTEN
050566  305300 000144       CAIGE TT,100.
050567  254000 050615        JRST QTDWA
050570  272310 044070       ADDM TT,QTWRTM(Q)
050571  505400 200000       HRLI Q,200000
050572  202400 044052       MOVEM Q,QDWIP
050573  552400 043006       HRRZM Q,DWSKN
050574  201300 000013       MOVEI TT,%QMTTW
050575  552300 042742       HRRZM TT,QSRAC+NQCHN
050576  205300 400000       MOVSI TT,(SETZ)
050577  437310 044056       IORB TT,QTUTO(Q)
050600  200610 044107       MOVE H,QPKNM(Q)     ;VERIFY THAT TUT IS NOT BEING CLOBBERED

At 50550 skips the relevant code when QACTTM is not >0, which when we hang, it never becomes. So this is why we don’t write out the TUTs. When in the hang state, I set this value to 1, and then we shut down properly.

I only see one place where QACTTM is incremented -- in the slow clock code, at offset 002255:

002252  436700 177573       IORM T,SUPCOR   ;INIT THE TTYS
                        SSLCK6:
                        ]
002253  333700 200631       SKIPLE T,AGERTE ;IF IT'S ENABLED, AGE SOME PAGES
002254  260640 002551        PUSHJ P,AGE
                        ;DROPS THROUGH
                                        Page 49
                        ;DROPS IN
                        IFN DL10P,[
                            SKIPN DL10F
                             JRST SSLCKA
                            SETOM DL10UP    ;COULD ALSO CHECK WHETHER PDP11 IS INCREMENTING THIS.
                            DATAI DLC,A
                            TRNE A,20   ;SEE IF PDP11 HALTED.
                             JRST SSLCKA
                            BUG INFO,[I/O PDP11 WENT DOWN]
                            SETZM DL10F
                            MOVEI T,%TYDL
                            MOVSI A,-NCT
                            TDNE T,TTYTYP(A)
                             SETOM TTYOAC(A)
                            AOBJN A,.-2
                        SSLCKA: ]
002255  350000 044214       AOS QACTTM      ;Bump this for benefit of QSK dir writers.

eswenson1 commented 1 year ago

I put a breakpoint at:

                        SSLCKA: ]
002255  350000 044214       AOS QACTTM      ;Bump this for benefit of QSK dir writers.
002256  336000 004740       SKIPN SWPOPR        ;If there is a request to swap out pages
002257  332000 004743        SKIPE SOLNG        ; or if there are pages now going out
002260  254000 002311         JRST SSKQ2        ; do not bother with LMEMFR, warnings.
002261  210040 144723       MOVN A,LMEMFR       ;See how many free low-memory pages.
002262  270040 144724       ADD A,MINCOR
002263  323040 002271       JUMPLE A,SSRCK      ;Okay
002264  350000 201104       AOS NCORRQ      ;Wake up core job to do some shuffling
002265  274040 144722       SUB A,MEMFR
002266  270040 144723       ADD A,LMEMFR        ;Core wanted minus high free
002267  333000 000001       SKIPLE A
002270  272040 004740        ADDM A,SWPOPR      ;Swap out to make room.

And see that the AOS QACTTM does increment from 777777777777 to 000000000000 and then to 000000000001. I’ve even seen it get to 000000000002. However, it always appears to get set to -1 via the above-mentioned SETOM QACTTM before we get the the TUT write loop.

It may be an interaction between interrupts, the slow clock and the fast clock. And it does appear to be a timing issue.

In fact, when I single step through a lot of this code, sometimes, when I “let it rip” (continue), we exit shutdown perfectly fine.

There is only one place where QACTTM gets bashed from its current value to 777777777777:

047643  552101 146327   QRDFN1: HRRZM B,MEMPNT(A)   ;PASS TO MAIN PROGRAM
047644  350003 042514       AOS QSBFS(C)
                        IFE DMDSK,  LDB B,[XWBLK QXWDS]
                             .ALSO  MOVEM B,QSLBLK(C)
                        ;XFER COMPLETE, DEACTIVATE CHANNEL
047645  476000 044214   QINT1B: SETOM QACTTM

Putting a breakpoint there, we see that we get there periodically — and each time, it sets QACTTM to 777777777777.

Then we get to the AOS, which goes from -1 to 0, and maybe some that get as high as 2, but before we get to the code in the TUT write loop, we’ve always hit 047645, and reset it to -1.

eswenson1 commented 1 year ago

I have found a solution, which may be considered a hack or workaround. In the code you see starting at 050550, we skip the TUT writing logic if QACTTM is < 1 (e.g. -1 or 0). When we do, add a counter and count the number of times we skip. When that number gets to some specific value, we reset it to 0, and then DON’T SKIP. That way, we’ll write out TUTs every so often. Picking the value of the threshold, of course, will take some experimentation. But it there are TUTs to write out, we should never skip processing them very long. This should make the file system a bit more reliable in any case, at the cost of doing more disk I/O than we used to.

My hack solution appears to work just fine. I can now shut down my ITS after GUNNER comes up, without having it hang.

Here is my fix:

*:srccom system;disk 1227,system;disk 1228

;COMPARISON OF DSK:SYSTEM;DISK 1227 AND DSK:SYSTEM;DISK 1228
;OPTIONS ARE    /3

**** FILE DSK:SYSTEM;DISK 1227, 47-2 (76081)
**** FILE DSK:SYSTEM;DISK 1228, 46-97 (76078)
QSKIPT: -10     ; number of TUT skips remaining
***************

**** FILE DSK:SYSTEM;DISK 1227, 80-49 (142781)
QMDWA:AOBJN Q,QMDW
QTDW:SKIPG QACTTM
 JRST QUDW
MOVSI Q,-NQS
**** FILE DSK:SYSTEM;DISK 1228, 80-49 (142830)
QTDWC:  AOS QSKIPT
SKIPG QSKIPT
  JRST QUDW
MOVEI Q,-10
MOVEM QQ,SKIPT
        JRST QTDWB
QMDWA:AOBJN Q,QMDW
QTDW:SKIPG QACTTM
 JRST QTDWC
QTDWB:MOVSI Q,-NQS
***************

:KILL
*

larsbrinkhoff commented 1 year ago

Thanks for your detailed analysis! I have read it a few times, but I don't think I have fully grasped what is going on yet. It's also curious only you have reported this. Are you running on some very fast computers, or some particular operating system?

Your solution does seem to be a workaround. Ideally we would come to an understanding exactly what the timing issue is, and close that window. But in lieu of that, let's have your hack in mind as a (temporary?) patch.

eswenson1 commented 1 year ago

I don’t think it is machine speed. I saw the problem first on a very slow Thinkpad laptop (about 10 years old). Then, to ease my debugging, I moved the entire directory to my very fast M2 Mac. I replaced the pdp10-ka executable with one built on the M2 Mac — and the problem occurred there. This was a 5-disk KA SYSTEM.

The fact that I saw this on a fresh build of a DB system still confounds me. It may be that the return to NTS DDT was just very, very slow and ai assumed the problem was the same. I added my patch though and then it shut down quickly.

Maybe the delay we see between shutdown and return to NTS DDT has always been due to this problem, but sometimes the timing works itself out quickly and sometimes not and, in my case, sometimes not at all.

desaster commented 7 months ago

I can confirm that this is randomly occurring for me as well. Running pdp10-ka on Raspberry Pi 5.

:lock
 LOCK.156
_5down
DO YOU REALLY WANT THE SYSTEM TO GO DOWN?
y
PLEASE ENTER A BRIEF MESSAGE TO USERS, ENDED BY ^C
UPI    LOCK   SYS    DOWN   MAIL   WRITE  10:22:46
^C
_
KA ITS going down in 5:00
CULPRIT =  UPI    LOCK   10:22:47
q
:KILL
KA ITS going down in 4:59
*:logoff
SYS: SYS3; TS LOGOFF - FILE NOT FOUND
:logout
 LOGOUT UPI    0 10:23:06

KA ITS 1651 SYSTEM JOB USING THIS CONSOLE. 10:23:06

KA ITS 1651 GOING DOWN IN 00:04:43.   10:23:06

KA ITS 1651 NOT IN OPERATION AT 10:23:06
TARAKA NAMDRG CHANNA _DRGN_ TIMES  WRITE  10:23:06
TARAKA NAMDRG CHANNA LOGOUT TIMES  DELRNM 10:23:06
PFTHMG DRAGON CHANNA _DRGN_ TIMES  WRITE  10:23:06
PFTHMG DRAGON CHANNA LOGOUT TIMES  DELRNM 10:23:06
IT IS NOW 10:23:15 AM EST, WEDNESDAY, FEB 21, 2024

eswenson1 commented 7 months ago

I could prepare a PR with my fix and you could try it.

desaster commented 7 months ago

I could prepare a PR with my fix and you could try it.

I'll definitely try it!

eswenson1 commented 7 months ago

@desaster Please try the above fix. All you need to do is get the version of SYSTEM;DISK 1228 from the PR, move it onto your ITS system, rebuild ITS, and reboot with your new ITS. See if, when you shut down, it now shuts down on its own.

larsbrinkhoff commented 7 months ago

@desaster, thanks for confirming. Since the Raspberry Pi 5 is the recommended board for the PiDP-10, I think it's even more important to get this fixed soon. I will try to get to the bottom as to why QACTTM never goes positive.

desaster commented 7 months ago

The fix seemed to work for me, so now I've been trying to reproduce the original issue. Unfortunately it's proving to be quite difficult, since now I can't get a vanilla installation to hang anymore.

I wrote a simple expect script to boot the the simulation, start ITS and shutdown. So far it's just not hanging anymore. I wonder if there's some specific activity I need to do in ITS to trigger the issue.

Automation script to start and shutdown ITS

``` #!/usr/bin/expect spawn $env(HOME)/ka10-simh-git/BIN/pdp10-ka pdp10.ini set timeout 30 expect_before timeout { send_user "Timeout exceeded!\r\n" send "\x1C" sleep 1 send "q\r" wait exit 1 } expect " DSKDMP" { send "its\r"; sleep 2; send "\033G" } expect "IN OPERATION" { sleep 1; send "\x1A"; sleep 1 } expect "Happy hacking!" { sleep 1; send ":login ROBOT\n" } expect "LOGIN ROBOT" { sleep 1; send ":lock\r" } expect " LOCK.156" { sleep 1; send "5down" } expect "DO YOU REALLY WANT THE SYSTEM TO GO DOWN?" { sleep 1 send "y" sleep 1 send \x03 sleep 1 send "q" sleep 1 } expect ":KILL" { sleep 1; send ":logout\r" } expect "SHUTDOWN COMPLETE" { sleep 1; send "\x1C" } expect "sim> " { sleep 1; send "q\r" } ```

larsbrinkhoff commented 7 months ago

It may well be the case that some activity triggers or aggravates the problem. One thing that springs to mind is disk I/O, in particular something to update the TUTs. That ought to be allocating or removing space for files.

eswenson1 commented 7 months ago

Yes, your script should probably cause more files to be written and perhaps created. The issue has to do with writing out TUTs so you need make sure that you set up ITS so there are some TUTs marked dirty.

of course any boot and shutdown will cause some disk writes, but in my experience, the bug manifests itself more, the more file output there is.

desaster commented 7 months ago

I tried doing a simple file copy, but it alone did not trigger the issue. I tried a different approach of letting the system sit idle for 300 seconds, and that seems to trigger the issue pretty consistently.

I will run more data in a for loop for ensuring the fix affects the issue, but of course the root cause will still be unknown.

Automation script with extra timeout

``` #!/usr/bin/expect spawn $env(HOME)/ka10-simh-git/BIN/pdp10-ka pdp10.ini set timeout 30 log_file vanilla.log expect_before timeout { send_user "Timeout exceeded!\r\n" send_log "SHUTDOWN_FAIL\n" send "\x1C" sleep 1 send "q\r" wait exit 1 } expect " DSKDMP" { send "its\r"; sleep 2; send "\033G" } expect "IN OPERATION" { sleep 1; send "\x1A"; sleep 1 } expect "Happy hacking!" { sleep 1; send ":login ROBOT\n" } expect "LOGIN ROBOT" { send_user "\r\nSleeping for a good while...\r\n" sleep 300 send ":lock\r" } expect " LOCK.156" { sleep 1; send "5down" } expect "DO YOU REALLY WANT THE SYSTEM TO GO DOWN?" { sleep 1 send "y" sleep 1 send \x03 sleep 1 send "q" sleep 1 } expect ":KILL" { sleep 1; send ":logout\r" } expect "SHUTDOWN COMPLETE" { sleep 1; send "\x1C" } expect "sim> " { sleep 1; send "q\r" } send_log "SHUTDOWN_SUCCESS\n" ```

eswenson1 commented 7 months ago

Just a comment on the above post by @desaster: Letting ITS sit for 5 minutes after a boot gives DEMSTR a chance to start up some of the DM Daemons -- such as BATCHN and GUNNER. And starting up both of these causes files to get created and updated. I think DEMSTR waits about 1 minute before starting to launch the configured daemons, and BATCHN and GUNNER have their own time delays for doing some of the work they do. BATCHN, for example, looks at job queues, runs jobs, and updates logs. GUNNER creates a log file and updates it.

Also, waiting 15 minutes will cause accounting updates to be done by DRAGON, but even waiting 5 minutes will cause DRAGON to write out some files.

So it is not surprising that waiting 5 minutes before shutting down will cause more TUTs to need to be written out upon shutdown.

eswenson1 commented 7 months ago

I don't really understand QACTTM, but I think it allows ITS to determine how recently it wrote out buffers/TUTs to disk. If it did so "too recently", it skips doing so. Code like this (in SYSTEM;DISK >):

       SKIPG QACTTM ;LAST ACTIVITY TOO RECENT
        JRST QUDW

basically skips writing buffers/TUTs to disk if the "last [disk] activity [was] too recent". The code in DISK either sets the value to -1 or or tests it for >0. The code in ITS does an "AOS" to increment it in the "slow clock" logic. So the clock will cause this increment from -1 to >0. It can increment past 0 (I've seen it) depending on how many "slow clock" ticks have occurred before it is set back to -1. It is set to -1 on read or write disk i/o.

And the problem, during shutdown is that a) shutdown won't complete if there are TUTs to write out, and b) we don't get to write them out because QACTTM gets incremented above 0 (by the slow clock) before we get to point where shutdown would write out the TUTs. So shutdown loops waiting on TUTs to write out, but due to the slow clock ticking, it never gets to the "write out the TUTs logic" when QACTTM is -1.

drboone commented 7 months ago

Approaching this from the perspective that what I know of the issue is what I read (and perhaps more importantly, the subset of it that I understood), so just tell me to shut up. But:

If the "hack" doesn't dramatically change the i/o pattern, and improves file system resilience, and fixes the seen issue, then I'm not convinced it should be thought of as a hack.
As an alternate solution, if there's a ready way to tell that a shutdown is in progress, perhaps it makes sense to skip the QACTTM "divider" and just flush the TUTs immediately? It doesn't seem like that would constitute "too much disk i/o", since it's important that the TUTs get flushed.

larsbrinkhoff commented 7 months ago

Shut up, and tell me more about option 2.

eswenson1 commented 7 months ago

Actually, I like option 2. I think that should be easy and possible. I’ll take a look.

The reason I did the fix the way I did was to try to ensure that TUTs were written out periodically when the system was running normally. ITS does write them out while the system is running, but the QACTTM value can delay this writing out, perhaps even indefinitely. My fix does make sure it does write them out after 10 skips.

foxkit commented 7 months ago

A unprovable hack that fixes the problem is a heuristic. I think RG taught me that. It may have been Rici.

-- cmm

On Fri, Feb 23, 2024, 3:17 PM Eric Swenson @.***> wrote:

Actually, I like option 2. I think that should be easy and possible. I’ll take a look.

The reason I did the fix the way I did was to try to ensure that TUTs were written out periodically when the system was running normally. ITS does write them out while the system is running, but the QACTTM value can delay this writing out, perhaps even indefinitely. My fix does make sure it does write them out after 10 skips.

— Reply to this email directly, view it on GitHub https://github.com/PDP-10/its/issues/2220#issuecomment-1961933801, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIAQOZPVDV7WJSQF4AJNI3YVD2NFAVCNFSM6AAAAAA46SN3MKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRHEZTGOBQGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

eswenson1 commented 7 months ago

The option 2 fix, I believe is:

diff disk.1227 disk.1229
5507c5507,5510
< QTDW: SKIPG QACTTM
---
> QTDW:
>     SKIPGE SHUTDN   ; if shutting down, make sure to update disk
>      JRST .+3
>       SKIPG QACTTM

This is smaller, less hacky, and only changes anything during shutdown. I like this much better. Thanks @drboone

I will test this fix and see if I can get it to hang.

eswenson1 commented 7 months ago

Well, preliminary results. I have a loop running a slightly modified expect script than the one @desaster posted. With the old SYSTEM;DISK, it pretty much hangs on shutdown in every iteration. Got to about 40 iterations. Then using the modified DISK, which simply ignores QACTTM when the system is being shut down, it hasn’t hung at all.

Thanks for the suggestion of option 2, @drboone .

larsbrinkhoff commented 7 months ago

Not only less of a hack, I think this could be considered an improvement.

It's still somewhat of a riddle why QACTTM doesn't go positive during shutdown. Makes me wonder, what if it also doesn't go positive in other circumstances? Is it the case that the TUTs aren't written to disk as often as they should be? But these are questions that we can figure out later (if at all).

eswenson1 commented 7 months ago

Well, the reason I originally chose the solution I did was because I was thinking that there was a timing anomaly that was preventing TUT writing — and that this could occur during normal system operations. My hack would have addressed that case, as well as the shutdown case.

But I thought it was a hack because the number 10, I chose for a maximum skip count was arbitrary. It solved the shutdown problem but might have caused the system to write TUTs and buffers to disk too often, or rather more often than the ITS authors intended.

And I like option 2 — my current solution (and @drboone ’s idea) because it has minimal Impact and provides a guarantee that the disk will be updated at shutdown.

eswenson1 commented 7 months ago

As far as I can see, my attempt at implementing option 2 does NOT result in success. In other words, we still hang at shutdown. I may have reversed the sense of a skip or something. I'm investigating.

desaster commented 7 months ago

As far as I can see, my attempt at implementing option 2 does NOT result in success. In other words, we still hang at shutdown. I may have reversed the sense of a skip or something. I'm investigating.

Confirmed, here are my automated results so far over 100 iterations:

vanilla success rate 21%
option 1 success rate 100%
option 2 first implementation success rate: 17%

eswenson1 commented 7 months ago

Yes, I think I needed a SKIPN SHUTDN rather than a SKIPGE SHUTDN. I'll look into this further and test when I get some time.

Update: that is wrong. The SKIPGE is correct. See later comment regarding 113/113 successes, no hang.

eswenson1 commented 7 months ago

I booted with that change, waited until BATCHN and GUNNER logged in, shut down, and it hung. So so far my attempts at an "option 2" fix have failed. Not sure why. My original solution still seems to work fine.

eswenson1 commented 7 months ago

I'm having a weird problem. The original fix isn't working for me either now on the particular test system I'm trying it on. However, I'm also seeing that PFTHMG is getting a TOP LEVEL INTERRUPT on shutdown, and it may well be that the fact that this JOB is still present that is preventing shutdown. It may not be the TUT write issue that is preventing my current shutdowns. I'm seeing the same behavior whether I use my original fix, no fix, or the "option 2" fix.

So I'm going to have to go back to debugging from the pdp10-ka simulator to see what is preventing shutdown now.

eswenson1 commented 7 months ago

I ran a loop of 100 boots with the SKIPN SHUTDN version, and I got 88 shutdowns and 12 hangs. So either I’m running into a different hang or the fix is just plain wrong. I haven’t yet had time to step through and disassemble to see why my hangs are occurring.

eswenson1 commented 7 months ago

I ran a loop of 312 boots with the original fix (hack, with QSKIPT, and all 312 shut down fine in short order.

So, clearly, the original fix works. The alternatives don't. I'll investigate.

eswenson1 commented 7 months ago

Well, I put back the fix that looks like:

QMDWA:  AOBJN Q,QMDW
QTDW:   SKIPGE SHUTDN
                  JRST .+3
                 SKIPG QACTTM
                  JRST QUDW
                 MOVSI Q,-NQS
QTDW1:  SKIPGE QACT(Q)
                  JRST QTDWA

Where only the:

                SKIPGE SHUTDN
                  JRST .+3

are the changed lines. Now that I got rid of the "Top Level Interrupt...Tree detached" error that was causing PFTHMG to go do, this fix seems "correct". I'm running the loop now, and so far, 113/113 successes (no failure/hangs).

I'll keep running this in a loop for some time and report on the results.

@desaster Are you sure you used my updated fix when you had failures? I was having failures for a different reason -- SYSJOB was not able to logout out all users, and therefore, since the user count never dropped to 0, ITS didn't shut down. Please make sure you are using the updated fix. And retry your loop. Please send me the log of your loop'ed running of a bunch of ITS boots/shutdowns so that I can see if there is any reason you might be hanging. By the way, I've put an expect -timeout 120 "SHUTDOWN COMPLETE" { sleep 1; send "\x1C" } because it sometimes really does take over a minute to shut down. I probably could reduce this now, but I bumped it up to make sure that I could tell the difference between a slow shutdown and a hung shutdown (which never completes).

desaster commented 7 months ago

I'm a bit lost with my testing, since now even "option 1" is failing every time, despite adding 120 second timeout.

I'm yet to try your updated fix, but I'll get to it.

eswenson1 commented 7 months ago

You may be having the same "different" problem that I was having. That's why I asked to look at your log. Exactly what messages do you see (if any) after the "ITS NOT IN OPERATION AT ..." message?

eswenson1 commented 7 months ago

I have now 283/283 successes, no hangs. This is with pdp10-ka and the fix to DISK is only adding the SKIPGE SHUTDOWN JRST .+3 fix.

desaster commented 7 months ago

You may be having the same "different" problem that I was having. That's why I asked to look at your log. Exactly what messages do you see (if any) after the "ITS NOT IN OPERATION AT ..." message?

Looks like my failing logs have the following, so there's a TOP LEVEL INTERRUPT just like you said.

KA ITS 1651 NOT IN OPERATION AT 18:14:21
TARAKA NAMDRG CHANNA _DRGN_ TIMES  WRITE  18:14:23
TARAKA NAMDRG CHANNA LOGOUT TIMES  DELRNM 18:14:23
TOP LEVEL INTERRUPT 1,,0 DETACHED JOB # 12, USR:PFTHMG DRAGON 18:14:28
Timeout exceeded!
SHUTDOWN_FAIL

eswenson1 commented 7 months ago

Yes, this is a different issue. Now that I’ve proven (to myself) that my option 2 fix addresses the original hang problem (TUTs not getting written out), I’m going to debug why the top level interrupt is happening and why it causes another shutdown hang. I’ll update this ticket when I learn something.

eswenson1 commented 7 months ago

In my case, I got top level interrupts from TARAKA PAPSAV and COMSAT IV. The TARAPA PAPSAV crash was a .VALUE because DRAGON;CDATA > could not be opened for writing. This was due to the fact that the DRAGON directory was full. Cleaning that out makes TARAKA PAPSAV not die. I'll investigate the COMSAT IV crash next. That is probably due to my not having COMSAT configured correctly (I don't think the KA build configures it correctly for the networking environment that it sets up).

eswenson1 commented 7 months ago

Ok, my COMSAT crash was due to a corrupt .mail. directory database. I fixed that and now it doesn't crash on startup (and on shutdown). I'm not getting clean boots and shutdowns. I'm going back to running a loop to make sure all is well.

eswenson1 commented 7 months ago

It turns out that you can’t simply reboot ITS in a loop without clearing out dragon;cdata * files. Too many iterations will fill up the directory with a new file every iteration. Our loop script should do a :delete dragon; cdata < upon login.

larsbrinkhoff commented 7 months ago

Maybe we should have something to watch DRAGON and ensure it's not overflowing with CDATA and REPORT files.

eswenson1 commented 7 months ago

Sounds like a job for another daemon! :-)

larsbrinkhoff commented 7 months ago

Fixed by #2276.

eswenson1 commented 7 months ago

@larsbrinkhoff The cleaning up of CDATA files may best be done (I've tested this on ES) by using the DM BATCH system. Here is a batch job that does the trick:

[ORIGINATOR     "EJS"
TASK-NAME       "DIR-CLEANUP"
TASK-STATE      "RUNNABLE"
TIME-OF-NEXT-RUN        ((124 3 5) (15 49 18) "EST")
TASK-TYPE       "SHORT"
WHEN-ORIGINATED ((124 3 4) (17 34 0) "EST")
RESCHEDULE-INFO ((0 0 1) (0 0 0))
FILE-TO-RUN     ["SYS" "EJS" "TS" "DIRED"]
INPUT-STREAM    "X .BATCH;SPECS 900XF
"
OUTPUT-FILE     ["DSK" "HUDINI" "DIR-CL" "RUN_#"]
"ELIMITER-STRING        "◊
TIME-LIMIT      5.0
RESTART-SWITCH  T
LAST-STARTING-TIME      ((124 3 4) (15 50 11) "EST")
LAST-COMPLETION-TIME    ((124 3 4) (15 50 12) "EST")
LAST-CPU-TIME-USED      0.97333333E-4
ACCUMULATED-CPU-TIME    60.259607
NUMBER-OF-TIMES-RUN     1
]

It uses an input file, .BATCH;SPECS 900XF (see above) that looks like this:

ANSWER YES
A DRAGON;
ER CDATA >
ER CDATA >
DIR ONLY DUMPED
D CDATA *
Q

That DIRED input file keeps the last 2 DRAGON;CDATA > files, and deletes all the other CDATA files that have been dumped.

The job reschedules itself each day.

PDP-10 / its

ITS sometimes hangs on shutdown and never returns to NTS DDT #2220