Closed eswenson1 closed 9 months ago
Also, if, when in the above situation you force a return to simulator command level (^) and quit the simulator. Upon restart, you may see disk errors (duplicate blocks, TUT errors):
. ;@ SALV PACK 2, UNIT #0
SHARES BLOCK WITH SOME OTHER FILE
. ;@ SALV PACK 2, UNIT #0
SHARES BLOCK WITH SOME OTHER FILE
SHARED BLOCKS
LOOKING FOR SHARED BLOCKS
. ;@ XITS 0-764, TUT= 2
. ;@ XITS 0-765, TUT= 2
. ;@ SALV 0-764, TUT= 2
. ;@ SALV 0-765, TUT= 2
*** ERROR *** SYSTEM MAY NOT BE BROUGHT BACK UP
Running the salvager before booting may result in messages like this:
l$ddt
t$salv bin
CHKR$G
SALVAGER.317
ACTIVE UNITS ARE:0123
UNIT #3 ID IS 1 PK #1
UNIT #2 ID IS 0 PK #0
UNIT #1 ID IS 3 PK #3
UNIT #0 ID IS 2 PK #2
USE MFD FROM UNIT:2
READ ALL BLOCKS OF ALL FILES?Y
GET USR DIRS FROM UNIT:2
.TEMP.; NO FILES, USER DIRECTORY DELETED
FILE UNPROTECTED IN OLD TUT - CHANNA;LOGOUT TIMES PACK 0, UNIT #2
Block 7603
FILE UNPROTECTED IN OLD TUT - DRAGON;DRAGON HOARD PACK 0, UNIT #2
Block 7547
FILE UNPROTECTED IN OLD TUT - DRAGON;CDATA 12 PACK 0, UNIT #2
Block 7540
TUT #0 27 1_2, PRINT?N
TUT #0 NEEDS UPDATING - WRITE?Y
TUT #2 25 1_2 3 1_0 3 0_1, PRINT?N
TUT #2 NEEDS UPDATING - WRITE?T
(Y OR N) Y
TUT #3 2 1_2, PRINT?N
TUT #3 NEEDS UPDATING - WRITE?Y
WRITE OUT CHANGES IN MFD?Y
DDT
The slow clock runs this code:
SSLCK: SKIPGE SHUTDN ;SYSTEM "DEAD" =>
PUSHJ P,COFFIN ;GO TO DDT AFTER RIGOR MORTIS SETS IN.
SETCMB A,SCKPHS ;CHECK SYSTEM CHECKSUMS EVERY OTHER SLOW CLOCK TICK
MOVE T,SUPCOP ;WHICH IS ONCE PER SECOND.
As you can see, if SHUTDN is set (it is), then we call COFFIN:
COFFIN: MOVE T,NLOOSP ;MUST SWAP OUT ALL LOOSE PAGES
ADDB T,SWPOPR
JUMPN T,CPOPJ ;SWAPPAGE OUT STILL GOING ON
MOVEI T,NQCHN+1+NQS-1
SKIPGE QSGL(T)
SOJG T,.-1
SKIPL QSGL(T) ;NOT OVER IF ANY DISK CHANNELS ARE ACTIVE
POPJ P,
MOVSI T,040000
MOVE TT,QACTB ;SEE IF DISK STUFF NOT OUT YET
TDNE TT,QMDRO
POPJ P, ;MFD NOT OUT
MOVSI A,-NQS
COFFI8: TDNE T,QTUTO(A)
JRST COFFI9 ;TUT NEVER CAME IN
TDNE TT,QTUTO(A)
POPJ P, ;TUT NOT OUT
COFFI9: AOBJN A,COFFI8
MOVSI A,-QNUD
COFFI1: SKIPN QSNUD(A)
JRST COFFI2
TDNE TT,QSNLCN(A)
POPJ P, ;UFD NOT WRITTEN
COFFI2: AOBJN A,COFFI1
COFFIN checks to see if pending disk I/o is complete. ;t checks to see if the MFD has been written out (it has in my case), and then checks to see if the “TUT NOT OUT” — in other words that one of the TUTs have not been written out. We find this is the case, and DON’T proceed with the shutdown.
Looking at the TUTs will show that some of them (it varies, how many) need to get written out, preventing shutdown:
QTUTO looks like this:
44056: 020000714000
44057: 010000716000
44060: 004000720000
44061: 002000722000
44062: 001000724000
(Sometimes, I've seen it with only one TUT needing to get written out -- it all depends on what disks were updated during the bootload prior to shutdown).
The reason why I believe this is a timing issue is because sometimes shutdown hangs and sometimes it doesn't. Sometimes shutdown takes a lot longer (2 minutes), sometimes shorter, sometimes within seconds.
If I put a breakpoint here (QTDW):
050550 337000 044214 QTDW: SKIPG QACTTM
050551 254000 050616 JRST QUDW
050552 205400 777773 MOVSI Q,-NQS
050553 335010 044171 QTDW1: SKIPGE QACT(Q)
050554 254000 050615 JRST QTDWA
050555 200450 046636 MOVE J,DCHBT(Q)
050556 331010 044056 SKIPL QTUTO(Q)
050557 616450 044056 TDNN J,QTUTO(Q)
050560 254000 050615 JRST QTDWA
050561 205600 240000 MOVSI H,240000
050562 612610 044056 TDNE H,QTUTO(Q)
050563 254000 050614 JRST QTDWA1
050564 200300 200666 MOVE TT,TIME ;DON'T WRITE TUTS TOO OFTEN
050565 274310 044070 SUB TT,QTWRTM(Q) ;BECAUSE THE TUT IS LOCKED WHILE IT'S BEING WRITTEN
050566 305300 000144 CAIGE TT,100.
When in the hang situation, we do get here periodically. All of the time QCCTM is either 0 or 777777777777, and we jump to QUDW (wait). We never get to the TUT write logic at 50564. The TUTs stay unwritten.
However, if I stop at breakpoints, continue from breakpoints, get the KA ITS REVIVED message, and repeat this a bunch of times, eventually, we do get to the TUT write logic and after N iterations (where N is not equal to TUT, but some multiple), we eventually write out all the TUTs. Continuing at that point gets to NTS DDT.
However, if I don’t put breakpoints at 50550 (and a few other places in that code fragment, above), we NEVER get to 50564 and the TUTs don’t get written out, and we never shut down.
One time, my KA instance (pdp10-ka) wouldn't shut down. Examining the TUTs showed that just one needed to be written out. In order to confirm my theory that the unwritten TUT was preventing shutdown from proceeding,
Just to test my theory, I forced the TUT to have no flags in the LH, indicating it didn't need to be read in or written out:
sim> d 44062 000000724000
sim> c
And immediately, I got this:
SHUTDOWN COMPLETE
PI LEVEL 7 BUGDDT. TYPE <ALTMODE>P TO CONTINUE.
YOU ARE NOW IN DDT.
BUGPC/ CAI COFFI4+1 $Q-2/ JRST COFFI7
As soon as I bashed the TUT bits for disk 4 to 0, the system shut down fine.
For the written record, some more evidence:
The documentation on QACTTM
is this (in SYSTEM;DISK >):
QACTTM ; Activity TOO RECENT
044214 777777 777777 QACTTM: -1 ;SETOM ON READ OR WRITE DATA TRANSFER
;AOS AT SSLCK IF + OK TO WRITE DIRS
;IE DON'T TIE UP DISK TO WRITE DIR UNLESS THINGS ARE QUIET
When we are in the hung state, is sometimes 000000000000 or 7777777777777, but never > 0 (when the TUT checking logic runs). As a result, we always skip the disk writing code.
This logic:
050547 253400 050514 QMDWA: AOBJN Q,QMDW
050550 337000 044214 QTDW: SKIPG QACTTM
050551 254000 050616 JRST QUDW
050552 205400 777773 MOVSI Q,-NQS
050553 335010 044171 QTDW1: SKIPGE QACT(Q)
050554 254000 050615 JRST QTDWA
050555 200450 046636 MOVE J,DCHBT(Q)
050556 331010 044056 SKIPL QTUTO(Q)
050557 616450 044056 TDNN J,QTUTO(Q)
050560 254000 050615 JRST QTDWA
050561 205600 240000 MOVSI H,240000
050562 612610 044056 TDNE H,QTUTO(Q)
050563 254000 050614 JRST QTDWA1
050564 200300 200666 MOVE TT,TIME ;DON'T WRITE TUTS TOO OFTEN
050565 274310 044070 SUB TT,QTWRTM(Q) ;BECAUSE THE TUT IS LOCKED WHILE IT'S BEING WRITTEN
050566 305300 000144 CAIGE TT,100.
050567 254000 050615 JRST QTDWA
050570 272310 044070 ADDM TT,QTWRTM(Q)
050571 505400 200000 HRLI Q,200000
050572 202400 044052 MOVEM Q,QDWIP
050573 552400 043006 HRRZM Q,DWSKN
050574 201300 000013 MOVEI TT,%QMTTW
050575 552300 042742 HRRZM TT,QSRAC+NQCHN
050576 205300 400000 MOVSI TT,(SETZ)
050577 437310 044056 IORB TT,QTUTO(Q)
050600 200610 044107 MOVE H,QPKNM(Q) ;VERIFY THAT TUT IS NOT BEING CLOBBERED
At 50550 skips the relevant code when QACTTM is not >0, which when we hang, it never becomes. So this is why we don’t write out the TUTs. When in the hang state, I set this value to 1, and then we shut down properly.
I only see one place where QACTTM is incremented -- in the slow clock code, at offset 002255:
002252 436700 177573 IORM T,SUPCOR ;INIT THE TTYS
SSLCK6:
]
002253 333700 200631 SKIPLE T,AGERTE ;IF IT'S ENABLED, AGE SOME PAGES
002254 260640 002551 PUSHJ P,AGE
;DROPS THROUGH
Page 49
;DROPS IN
IFN DL10P,[
SKIPN DL10F
JRST SSLCKA
SETOM DL10UP ;COULD ALSO CHECK WHETHER PDP11 IS INCREMENTING THIS.
DATAI DLC,A
TRNE A,20 ;SEE IF PDP11 HALTED.
JRST SSLCKA
BUG INFO,[I/O PDP11 WENT DOWN]
SETZM DL10F
MOVEI T,%TYDL
MOVSI A,-NCT
TDNE T,TTYTYP(A)
SETOM TTYOAC(A)
AOBJN A,.-2
SSLCKA: ]
002255 350000 044214 AOS QACTTM ;Bump this for benefit of QSK dir writers.
I put a breakpoint at:
SSLCKA: ]
002255 350000 044214 AOS QACTTM ;Bump this for benefit of QSK dir writers.
002256 336000 004740 SKIPN SWPOPR ;If there is a request to swap out pages
002257 332000 004743 SKIPE SOLNG ; or if there are pages now going out
002260 254000 002311 JRST SSKQ2 ; do not bother with LMEMFR, warnings.
002261 210040 144723 MOVN A,LMEMFR ;See how many free low-memory pages.
002262 270040 144724 ADD A,MINCOR
002263 323040 002271 JUMPLE A,SSRCK ;Okay
002264 350000 201104 AOS NCORRQ ;Wake up core job to do some shuffling
002265 274040 144722 SUB A,MEMFR
002266 270040 144723 ADD A,LMEMFR ;Core wanted minus high free
002267 333000 000001 SKIPLE A
002270 272040 004740 ADDM A,SWPOPR ;Swap out to make room.
And see that the AOS QACTTM
does increment from 777777777777
to 000000000000
and then to 000000000001
. I’ve even seen it get to 000000000002
. However, it always appears
to get set to -1 via the above-mentioned SETOM QACTTM
before we get the the TUT write loop.
It may be an interaction between interrupts, the slow clock and the fast clock. And it does appear to be a timing issue.
In fact, when I single step through a lot of this code, sometimes, when I “let it rip” (continue), we exit shutdown perfectly fine.
There is only one place where QACTTM gets bashed from its current value to 777777777777
:
047643 552101 146327 QRDFN1: HRRZM B,MEMPNT(A) ;PASS TO MAIN PROGRAM
047644 350003 042514 AOS QSBFS(C)
IFE DMDSK, LDB B,[XWBLK QXWDS]
.ALSO MOVEM B,QSLBLK(C)
;XFER COMPLETE, DEACTIVATE CHANNEL
047645 476000 044214 QINT1B: SETOM QACTTM
Putting a breakpoint there, we see that we get there periodically — and each time, it sets QACTTM to 777777777777.
Then we get to the AOS, which goes from -1 to 0, and maybe some that get as high as 2, but before we get to the code in the TUT write loop, we’ve always hit 047645, and reset it to -1.
I have found a solution, which may be considered a hack or workaround. In the code you see starting at 050550, we skip the TUT writing logic if QACTTM is < 1 (e.g. -1 or 0). When we do, add a counter and count the number of times we skip. When that number gets to some specific value, we reset it to 0, and then DON’T SKIP. That way, we’ll write out TUTs every so often. Picking the value of the threshold, of course, will take some experimentation. But it there are TUTs to write out, we should never skip processing them very long. This should make the file system a bit more reliable in any case, at the cost of doing more disk I/O than we used to.
My hack solution appears to work just fine. I can now shut down my ITS after GUNNER comes up, without having it hang.
Here is my fix:
*:srccom system;disk 1227,system;disk 1228
;COMPARISON OF DSK:SYSTEM;DISK 1227 AND DSK:SYSTEM;DISK 1228
;OPTIONS ARE /3
**** FILE DSK:SYSTEM;DISK 1227, 47-2 (76081)
**** FILE DSK:SYSTEM;DISK 1228, 46-97 (76078)
QSKIPT: -10 ; number of TUT skips remaining
***************
**** FILE DSK:SYSTEM;DISK 1227, 80-49 (142781)
QMDWA:AOBJN Q,QMDW
QTDW:SKIPG QACTTM
JRST QUDW
MOVSI Q,-NQS
**** FILE DSK:SYSTEM;DISK 1228, 80-49 (142830)
QTDWC: AOS QSKIPT
SKIPG QSKIPT
JRST QUDW
MOVEI Q,-10
MOVEM QQ,SKIPT
JRST QTDWB
QMDWA:AOBJN Q,QMDW
QTDW:SKIPG QACTTM
JRST QTDWC
QTDWB:MOVSI Q,-NQS
***************
:KILL
*
Thanks for your detailed analysis! I have read it a few times, but I don't think I have fully grasped what is going on yet. It's also curious only you have reported this. Are you running on some very fast computers, or some particular operating system?
Your solution does seem to be a workaround. Ideally we would come to an understanding exactly what the timing issue is, and close that window. But in lieu of that, let's have your hack in mind as a (temporary?) patch.
I don’t think it is machine speed. I saw the problem first on a very slow Thinkpad laptop (about 10 years old). Then, to ease my debugging, I moved the entire directory to my very fast M2 Mac. I replaced the pdp10-ka executable with one built on the M2 Mac — and the problem occurred there. This was a 5-disk KA SYSTEM.
The fact that I saw this on a fresh build of a DB system still confounds me. It may be that the return to NTS DDT was just very, very slow and ai assumed the problem was the same. I added my patch though and then it shut down quickly.
Maybe the delay we see between shutdown and return to NTS DDT has always been due to this problem, but sometimes the timing works itself out quickly and sometimes not and, in my case, sometimes not at all.
I can confirm that this is randomly occurring for me as well. Running pdp10-ka on Raspberry Pi 5.
:lock
LOCK.156
_5down
DO YOU REALLY WANT THE SYSTEM TO GO DOWN?
y
PLEASE ENTER A BRIEF MESSAGE TO USERS, ENDED BY ^C
UPI LOCK SYS DOWN MAIL WRITE 10:22:46
^C
_
KA ITS going down in 5:00
CULPRIT = UPI LOCK 10:22:47
q
:KILL
KA ITS going down in 4:59
*:logoff
SYS: SYS3; TS LOGOFF - FILE NOT FOUND
:logout
LOGOUT UPI 0 10:23:06
KA ITS 1651 SYSTEM JOB USING THIS CONSOLE. 10:23:06
KA ITS 1651 GOING DOWN IN 00:04:43. 10:23:06
KA ITS 1651 NOT IN OPERATION AT 10:23:06
TARAKA NAMDRG CHANNA _DRGN_ TIMES WRITE 10:23:06
TARAKA NAMDRG CHANNA LOGOUT TIMES DELRNM 10:23:06
PFTHMG DRAGON CHANNA _DRGN_ TIMES WRITE 10:23:06
PFTHMG DRAGON CHANNA LOGOUT TIMES DELRNM 10:23:06
IT IS NOW 10:23:15 AM EST, WEDNESDAY, FEB 21, 2024
I could prepare a PR with my fix and you could try it.
I could prepare a PR with my fix and you could try it.
I'll definitely try it!
@desaster Please try the above fix. All you need to do is get the version of SYSTEM;DISK 1228 from the PR, move it onto your ITS system, rebuild ITS, and reboot with your new ITS. See if, when you shut down, it now shuts down on its own.
@desaster, thanks for confirming. Since the Raspberry Pi 5 is the recommended board for the PiDP-10, I think it's even more important to get this fixed soon. I will try to get to the bottom as to why QACTTM never goes positive.
The fix seemed to work for me, so now I've been trying to reproduce the original issue. Unfortunately it's proving to be quite difficult, since now I can't get a vanilla installation to hang anymore.
I wrote a simple expect script to boot the the simulation, start ITS and shutdown. So far it's just not hanging anymore. I wonder if there's some specific activity I need to do in ITS to trigger the issue.
It may well be the case that some activity triggers or aggravates the problem. One thing that springs to mind is disk I/O, in particular something to update the TUTs. That ought to be allocating or removing space for files.
Yes, your script should probably cause more files to be written and perhaps created. The issue has to do with writing out TUTs so you need make sure that you set up ITS so there are some TUTs marked dirty.
of course any boot and shutdown will cause some disk writes, but in my experience, the bug manifests itself more, the more file output there is.
I tried doing a simple file copy, but it alone did not trigger the issue. I tried a different approach of letting the system sit idle for 300 seconds, and that seems to trigger the issue pretty consistently.
I will run more data in a for loop for ensuring the fix affects the issue, but of course the root cause will still be unknown.
Just a comment on the above post by @desaster: Letting ITS sit for 5 minutes after a boot gives DEMSTR a chance to start up some of the DM Daemons -- such as BATCHN and GUNNER. And starting up both of these causes files to get created and updated. I think DEMSTR waits about 1 minute before starting to launch the configured daemons, and BATCHN and GUNNER have their own time delays for doing some of the work they do. BATCHN, for example, looks at job queues, runs jobs, and updates logs. GUNNER creates a log file and updates it.
Also, waiting 15 minutes will cause accounting updates to be done by DRAGON, but even waiting 5 minutes will cause DRAGON to write out some files.
So it is not surprising that waiting 5 minutes before shutting down will cause more TUTs to need to be written out upon shutdown.
I don't really understand QACTTM, but I think it allows ITS to determine how recently it wrote out buffers/TUTs to disk. If it did so "too recently", it skips doing so. Code like this (in SYSTEM;DISK >):
SKIPG QACTTM ;LAST ACTIVITY TOO RECENT
JRST QUDW
basically skips writing buffers/TUTs to disk if the "last [disk] activity [was] too recent". The code in DISK either sets the value to -1 or or tests it for >0. The code in ITS does an "AOS" to increment it in the "slow clock" logic. So the clock will cause this increment from -1 to >0. It can increment past 0 (I've seen it) depending on how many "slow clock" ticks have occurred before it is set back to -1. It is set to -1 on read or write disk i/o.
And the problem, during shutdown is that a) shutdown won't complete if there are TUTs to write out, and b) we don't get to write them out because QACTTM gets incremented above 0 (by the slow clock) before we get to point where shutdown would write out the TUTs. So shutdown loops waiting on TUTs to write out, but due to the slow clock ticking, it never gets to the "write out the TUTs logic" when QACTTM is -1.
Approaching this from the perspective that what I know of the issue is what I read (and perhaps more importantly, the subset of it that I understood), so just tell me to shut up. But:
Shut up, and tell me more about option 2.
Actually, I like option 2. I think that should be easy and possible. I’ll take a look.
The reason I did the fix the way I did was to try to ensure that TUTs were written out periodically when the system was running normally. ITS does write them out while the system is running, but the QACTTM value can delay this writing out, perhaps even indefinitely. My fix does make sure it does write them out after 10 skips.
A unprovable hack that fixes the problem is a heuristic. I think RG taught me that. It may have been Rici.
-- cmm
On Fri, Feb 23, 2024, 3:17 PM Eric Swenson @.***> wrote:
Actually, I like option 2. I think that should be easy and possible. I’ll take a look.
The reason I did the fix the way I did was to try to ensure that TUTs were written out periodically when the system was running normally. ITS does write them out while the system is running, but the QACTTM value can delay this writing out, perhaps even indefinitely. My fix does make sure it does write them out after 10 skips.
— Reply to this email directly, view it on GitHub https://github.com/PDP-10/its/issues/2220#issuecomment-1961933801, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIAQOZPVDV7WJSQF4AJNI3YVD2NFAVCNFSM6AAAAAA46SN3MKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRHEZTGOBQGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
The option 2 fix, I believe is:
diff disk.1227 disk.1229
5507c5507,5510
< QTDW: SKIPG QACTTM
---
> QTDW:
> SKIPGE SHUTDN ; if shutting down, make sure to update disk
> JRST .+3
> SKIPG QACTTM
This is smaller, less hacky, and only changes anything during shutdown. I like this much better. Thanks @drboone
I will test this fix and see if I can get it to hang.
Well, preliminary results. I have a loop running a slightly modified expect script than the one @desaster posted. With the old SYSTEM;DISK, it pretty much hangs on shutdown in every iteration. Got to about 40 iterations. Then using the modified DISK, which simply ignores QACTTM when the system is being shut down, it hasn’t hung at all.
Thanks for the suggestion of option 2, @drboone .
Not only less of a hack, I think this could be considered an improvement.
It's still somewhat of a riddle why QACTTM doesn't go positive during shutdown. Makes me wonder, what if it also doesn't go positive in other circumstances? Is it the case that the TUTs aren't written to disk as often as they should be? But these are questions that we can figure out later (if at all).
Well, the reason I originally chose the solution I did was because I was thinking that there was a timing anomaly that was preventing TUT writing — and that this could occur during normal system operations. My hack would have addressed that case, as well as the shutdown case.
But I thought it was a hack because the number 10, I chose for a maximum skip count was arbitrary. It solved the shutdown problem but might have caused the system to write TUTs and buffers to disk too often, or rather more often than the ITS authors intended.
And I like option 2 — my current solution (and @drboone ’s idea) because it has minimal Impact and provides a guarantee that the disk will be updated at shutdown.
As far as I can see, my attempt at implementing option 2 does NOT result in success. In other words, we still hang at shutdown. I may have reversed the sense of a skip or something. I'm investigating.
As far as I can see, my attempt at implementing option 2 does NOT result in success. In other words, we still hang at shutdown. I may have reversed the sense of a skip or something. I'm investigating.
Confirmed, here are my automated results so far over 100 iterations:
Yes, I think I needed a SKIPN SHUTDN
rather than a SKIPGE SHUTDN
. I'll look into this further and test when I get some time.
Update: that is wrong. The SKIPGE is correct. See later comment regarding 113/113 successes, no hang.
I booted with that change, waited until BATCHN and GUNNER logged in, shut down, and it hung. So so far my attempts at an "option 2" fix have failed. Not sure why. My original solution still seems to work fine.
I'm having a weird problem. The original fix isn't working for me either now on the particular test system I'm trying it on. However, I'm also seeing that PFTHMG is getting a TOP LEVEL INTERRUPT on shutdown, and it may well be that the fact that this JOB is still present that is preventing shutdown. It may not be the TUT write issue that is preventing my current shutdowns. I'm seeing the same behavior whether I use my original fix, no fix, or the "option 2" fix.
So I'm going to have to go back to debugging from the pdp10-ka simulator to see what is preventing shutdown now.
I ran a loop of 100 boots with the SKIPN SHUTDN version, and I got 88 shutdowns and 12 hangs. So either I’m running into a different hang or the fix is just plain wrong. I haven’t yet had time to step through and disassemble to see why my hangs are occurring.
I ran a loop of 312 boots with the original fix (hack, with QSKIPT, and all 312 shut down fine in short order.
So, clearly, the original fix works. The alternatives don't. I'll investigate.
Well, I put back the fix that looks like:
QMDWA: AOBJN Q,QMDW
QTDW: SKIPGE SHUTDN
JRST .+3
SKIPG QACTTM
JRST QUDW
MOVSI Q,-NQS
QTDW1: SKIPGE QACT(Q)
JRST QTDWA
Where only the:
SKIPGE SHUTDN
JRST .+3
are the changed lines. Now that I got rid of the "Top Level Interrupt...Tree detached" error that was causing PFTHMG to go do, this fix seems "correct". I'm running the loop now, and so far, 113/113 successes (no failure/hangs).
I'll keep running this in a loop for some time and report on the results.
@desaster Are you sure you used my updated fix when you had failures? I was having failures for a different reason -- SYSJOB was not able to logout out all users, and therefore, since the user count never dropped to 0, ITS didn't shut down. Please make sure you are using the updated fix. And retry your loop. Please send me the log of your loop'ed running of a bunch of ITS boots/shutdowns so that I can see if there is any reason you might be hanging. By the way, I've put an expect -timeout 120 "SHUTDOWN COMPLETE" { sleep 1; send "\x1C" }
because it sometimes really does take over a minute to shut down. I probably could reduce this now, but I bumped it up to make sure that I could tell the difference between a slow shutdown and a hung shutdown (which never completes).
I'm a bit lost with my testing, since now even "option 1" is failing every time, despite adding 120 second timeout.
I'm yet to try your updated fix, but I'll get to it.
You may be having the same "different" problem that I was having. That's why I asked to look at your log. Exactly what messages do you see (if any) after the "ITS NOT IN OPERATION AT ..." message?
I have now 283/283 successes, no hangs. This is with pdp10-ka and the fix to DISK is only adding the SKIPGE SHUTDOWN
JRST .+3
fix.
You may be having the same "different" problem that I was having. That's why I asked to look at your log. Exactly what messages do you see (if any) after the "ITS NOT IN OPERATION AT ..." message?
Looks like my failing logs have the following, so there's a TOP LEVEL INTERRUPT just like you said.
KA ITS 1651 NOT IN OPERATION AT 18:14:21
TARAKA NAMDRG CHANNA _DRGN_ TIMES WRITE 18:14:23
TARAKA NAMDRG CHANNA LOGOUT TIMES DELRNM 18:14:23
TOP LEVEL INTERRUPT 1,,0 DETACHED JOB # 12, USR:PFTHMG DRAGON 18:14:28
Timeout exceeded!
SHUTDOWN_FAIL
Yes, this is a different issue. Now that I’ve proven (to myself) that my option 2 fix addresses the original hang problem (TUTs not getting written out), I’m going to debug why the top level interrupt is happening and why it causes another shutdown hang. I’ll update this ticket when I learn something.
In my case, I got top level interrupts from TARAKA PAPSAV and COMSAT IV. The TARAPA PAPSAV crash was a .VALUE because DRAGON;CDATA > could not be opened for writing. This was due to the fact that the DRAGON directory was full. Cleaning that out makes TARAKA PAPSAV not die. I'll investigate the COMSAT IV crash next. That is probably due to my not having COMSAT configured correctly (I don't think the KA build configures it correctly for the networking environment that it sets up).
Ok, my COMSAT crash was due to a corrupt .mail. directory database. I fixed that and now it doesn't crash on startup (and on shutdown). I'm not getting clean boots and shutdowns. I'm going back to running a loop to make sure all is well.
It turns out that you can’t simply reboot ITS in a loop without clearing out dragon;cdata * files. Too many iterations will fill up the directory with a new file every iteration. Our loop script should do a :delete dragon; cdata < upon login.
Maybe we should have something to watch DRAGON and ensure it's not overflowing with CDATA and REPORT files.
Sounds like a job for another daemon! :-)
Fixed by #2276.
@larsbrinkhoff The cleaning up of CDATA files may best be done (I've tested this on ES) by using the DM BATCH system. Here is a batch job that does the trick:
[ORIGINATOR "EJS"
TASK-NAME "DIR-CLEANUP"
TASK-STATE "RUNNABLE"
TIME-OF-NEXT-RUN ((124 3 5) (15 49 18) "EST")
TASK-TYPE "SHORT"
WHEN-ORIGINATED ((124 3 4) (17 34 0) "EST")
RESCHEDULE-INFO ((0 0 1) (0 0 0))
FILE-TO-RUN ["SYS" "EJS" "TS" "DIRED"]
INPUT-STREAM "X .BATCH;SPECS 900XF
"
OUTPUT-FILE ["DSK" "HUDINI" "DIR-CL" "RUN_#"]
"ELIMITER-STRING "◊
TIME-LIMIT 5.0
RESTART-SWITCH T
LAST-STARTING-TIME ((124 3 4) (15 50 11) "EST")
LAST-COMPLETION-TIME ((124 3 4) (15 50 12) "EST")
LAST-CPU-TIME-USED 0.97333333E-4
ACCUMULATED-CPU-TIME 60.259607
NUMBER-OF-TIMES-RUN 1
]
It uses an input file, .BATCH;SPECS 900XF (see above) that looks like this:
ANSWER YES
A DRAGON;
ER CDATA >
ER CDATA >
DIR ONLY DUMPED
D CDATA *
Q
That DIRED input file keeps the last 2 DRAGON;CDATA > files, and deletes all the other CDATA files that have been dumped.
The job reschedules itself each day.
I've seen the issue on both a KA (pdp10-ka) and a DB (klh10) ITS systems. When you attempt to shut down ITS with LOCK you may run into similar console output:
ITS does not return to NTS DDT but continues "running" indefinitely. Analysis of the situation shows that ITS is waiting for all TUTs to get written out to disk, but the code that writes out the TUTs is waiting for the
QACTTM
counter to become positive.QACTTM
is set to -1, and then at various times, incremented (with AOS). Even during the "shutdown hang", it can be seen to increment past 0 (I've seen it go to 1 and 2), but always seems to get reset to -1, and possibly incremented to 1, before the TUT writing code checks it.As a result, the TUTs are never written out, and because there may be one or more TUTs marked as needing to get written out, ITS will not execute the logic that returns to NTS DDT. It is waiting on the TUTs to get written out.
The issue appears to be a timing issue. At first, I noticed the issue on a pdp10-ka (KA) instance, where I had added disks to make the total number of disks 5. I thought it had something to do with the addition of the 5th disk. But then, I ran into the same issue with a newly-built KLH10 DB instance, where there had been no changes made -- the build was a pristine one.
There appears to be a negative interaction between the slow clock (which, when in the "shutdown" mode, is the one that checks to see if the TUTs are all written out before returning to NTS DDT), and the disk interrupts and fast clock used to ensure disk i/o is done.
I added a fix to SYSTEM; DISK >, which guarantees that TUTs will be written out if doing so was skipped over due to
QACTTM
being non-positive more than 10 times. This allows the system to shut down in the above situation.