PDP-10 / its

Incompatible Timesharing System
Other
841 stars 80 forks source link

Build SIMH fails with time out #2172

Closed oilcan-productions closed 9 months ago

oilcan-productions commented 1 year ago

I cloned the latest and ran make EMULATOR=simh on Raspian Bullseye. the build fails with the below error.

*:midas sys; ts rbye_cfs; zork
TRIVIA startup
TRIVIA startup
Constants area inclusive
From    To
1223    1354
Run time = 0.01
1499 Symbols including initial ones (55% used)

:KILL 
*:link sys1;ts zork, sys; ts rbye
*
:print cfs;..new. (udir)
DSK: CFS; ..NEW. (UDIR) - FILE NOT FOUND
:vk
*:xxfile lcf;comp log_lcf;comp xxfile
:PROCED
*

The last command timed out.
make: *** [Makefile:163: out/simh/rp0.dsk] Error 1
eswenson1 commented 1 year ago

We may to increase the timeout -- unless the compilation, for some reason, failed. Only way to tell would be to look in that lcf;comp log file. The most recent changes didn't involve any changes to the source to zork, nor the executables needed to compile it, so not sure what caused this timeout. It may be that your machine is slow?

oilcan-productions commented 1 year ago

@eswenson1 I was playing around with this for the last 10 days or so and found that no matter what timeout I set for a build it would fail with the timeout error with the exception of building EMULATOR=KLH-10. When the build gets to line 164 of the Makefile is where it hangs when booting from the RP0.dsk same happens when trying to load that manually after failing to launch. I changed the timeout from 100 to 1000 in build/build.tcl for the setup_timeout function, which seems to be the central location controlling that. You mentioned the lcf;comp log file. Where would I find that once the build fails?

oilcan-productions commented 1 year ago

looking at the lcf directory on the RP0 created before it failed I see two comp files

 0   COMP   PREAMB 1 ! 11/30/2016 03:56:40
 0   COMP   XXFILE 1 ! 11/30/2016 03:56:40
*:print comp preamb
<SNAME "LCF">$
<FLOAD "prim nbin">$
<FLOAD "defs nbin">$
<FLOAD "util nbin">$
<FLOAD "tell nbin">$
<FLOAD "makstr nbin">$
<FLOAD "typhak nbin">$
<OVERFLOW <>>$
*:print comp xxfile
^^R
:pcomp
<SNAME "LCF">$
^^J
<FILE-COMPILE "prim >">$
$p
^^J
<FILE-COMPILE "defs >">$
$p
^^J
<FILE-COMPILE "util >">$
$p
^^J
<FILE-COMPILE "makstr >">$
$p
^^J
<FILE-COMPILE "typhak >">$
$p
^^J
<QUIT>$
:assem "lcf;tell >" "lcf;tell nbin"
n
^^J
:pcomp
<FLOAD "lcf;comp preamb">$
^^J
<FILE-COMPILE "rooms >">$
$p
^^J
<FILE-COMPILE "parser >">$
$p
^^J
<QUIT>$
:pcomp
<FLOAD "lcf;comp preamb">$
^^J
<FILE-COMPILE "act1 >">$
$p
^^J
<FILE-COMPILE "act2 >">$
$p
^^J
<QUIT>$
:pcomp
<FLOAD "lcf;comp preamb">$
^^J
<FILE-COMPILE "act3 >">$
$p
^^J
<FILE-COMPILE "act4 >">$
$p
^^J
<QUIT>$
:pcomp
<FLOAD "lcf;comp preamb">$
^^J
<FILE-COMPILE "melee >">$
$p
^^J
<FILE-COMPILE "sr >">$
$p
^^J
<QUIT>$
^^Q
eswenson1 commented 1 year ago

In build/zork.tcl, we have these commands:

respond "*" ":xxfile lcf;comp log_lcf;comp xxfile\r"
expect -timeout 6000 "Job XXFILE interrupted: .VALUE;"
type "\033p"
expect ":KILL"

So lcf;comp log is where XXFILE puts its "console output". When the build fails, it shouldn't delete the disk. So simply boot ITS using that disk, and look in the LCF directory. You should find the COMP LOG file there.

To boot that ITS, just go to the /build directory, and invoke the emulator with the correct command line parameters and start ITS.

For me, the only reliable emulator has been KLH10, so I use it (almost) exclusively. So I'd do:

cd build/klh10
sudo ./kn10-ks-its dskdmp.ini
go
its
$g
^Z
:listf lcf;
oilcan-productions commented 1 year ago

I have done that posted the content here https://github.com/PDP-10/its/issues/2172#issuecomment-1470915823

eswenson1 commented 1 year ago

Had you gotten to the xxfile lcf;comp log_lcf;comp xxfile line in the zork.tcl script? Because if you had, you should have found lcf;comp log created.

oilcan-productions commented 1 year ago

Yes the file is there. I missed it last time. I extended the time out to 12000 and it still times out. lcf;comp log is 0 bytes though

eswenson1 commented 1 year ago

so what is in the file? did one of the commands die or hang?

oilcan-productions commented 1 year ago

There is nothing in the comp log file. It is zero bytes :(

eswenson1 commented 1 year ago

That is odd. Can you boot the system using that disk and run that XXFILE command manually? You can replace the file name before the “_” in the XXFILE command line with TTY: so that the output of the XXFILE run goes to the terminal rather than the LCF;COMP LOG file.

oilcan-productions commented 1 year ago
*:xxfile TTY:_lcf;xxcomp xxfile
 STYOPN ZZ     11 XXFILE 09:41:45
DB ITS.1651. DDT.1548.
TTY 11
2. Lusers, Fair Share = 101%
ZZ$0U
 LOGIN  ZZ0    11 09:41:46
To see system messages, do ":MSGS<CR>"
:GAG 0
*
:pcomp
You have about 8 cpu minutes to do your thing.
MUDDLE COMPILER NOW READY.
T

ERROR:  Job changed when not allowed.
oilcan-productions commented 1 year ago

And then it sits there, not sure if it continues doing stuff. No more output for 10 minutes

oilcan-productions commented 1 year ago

Ah I had to set time and date for it to go through PDSET for the rescue

oilcan-productions commented 1 year ago

Ok that ran through. Took about 25 minutes

eswenson1 commented 1 year ago

Cool. Yes, a lot of stuff doesn't work terribly well when the system doesn't know the time.

oilcan-productions commented 1 year ago

I have now extended the timeout to 35000 in zork.tcl. We will see if that helps

oilcan-productions commented 1 year ago

Extending the timeout did not solve the problem. The 2 things that seemed to help and get me past some of the issues

  1. Increase GPU memory to 128MB using sudo raspi-config
  2. always build with make check-dirs all EMULATOR=<emulator spec>
oilcan-productions commented 1 year ago

and 3rd clone into separate folders for each emulator instead of building multiple from one folder

oilcan-productions commented 1 year ago

Documenting progress:

eswenson1 commented 1 year ago

Try this:

diff --git a/build/muddle.tcl b/build/muddle.tcl
index b41b1fee..65ddcf10 100644
--- a/build/muddle.tcl
+++ b/build/muddle.tcl
@@ -121,7 +121,8 @@ respond "T" "<SNAME \".batch\">\033"
 respond "\".batch\"" "<FILE-COMPILE \"templt >\" \"templt nbin\">\033"
 respond "Job ECOMP wants the TTY" "\033p"
 respond "I'm done anyway." "<FILE-COMPILE \"tcheck >\" \"tcheck nbin\">\033"
-respond "Job ECOMP wants the TTY" "\033p"
+expect -timeout 600 "Job ECOMP wants the TTY"
+type "\033p"
 respond "I'm done anyway." "<FILE-COMPILE \"taskm >\" \"taskm nbin\">\033"
 expect -timeout 600 "Job ECOMP wants the TTY"
 type "\033p"

Seems KLH10 is just faster than pdp10-ka. Compiles are taking longer.

oilcan-productions commented 1 year ago

@eswenson1 I believe when building EMULATOR=pdp10-kl it is not using KLH10 it is using SIMH based KL emulator, only when using EMULATOR=klh10 it would use the KLH10 emulator, correct?

oilcan-productions commented 1 year ago

If I want to get to a point where all the machines are networked and can run on the same network at the same time I need to get the different machines built that are supported. Or am I reaching too high ?

eswenson1 commented 1 year ago

That's correct. My point is that I tested all this on KLH10 on my machine. And the CI builds passed. But on your machine, it would appear that pdp10-ka is timing out. So increase the timeout for that compile (you might have to do it for several, or change the default value that the respond TCL macro uses.

eswenson1 commented 1 year ago

And no, you aren't reaching too high. I have builds of KLH10, pdp10-ka, pdp10-kl, and pdp10-ks ITS all running on a machine. They are all networked via chaosnet.

larsbrinkhoff commented 1 year ago

EMULATOR=klh10 is Ken Harrenstien's KLH10. EMULATOR=simh is Bob Supnik's KS10 in SIMH. All the others, pdp10-ka, pdp10-kl, and pdp10-kl, are Richard Cornell's SIMH based emulators.

oilcan-productions commented 1 year ago

@eswenson1 the pattern seems to work. I was able to build all SIMH based emulators now. I will submit a PR later today

eswenson1 commented 1 year ago

@oilcan-productions That’s good news. Thanks. I wonder if we should do a rudimentary build machine speed test as part of our build and then adjust the default timeout based on the findings.

oilcan-productions commented 1 year ago

I am looking into at least checking the CPU architecture and disk type to see if we might run into issues. We could also make the timeout an environment variable and put it into user control.

oilcan-productions commented 1 year ago

So I went and made the changes locally for all emulators. PDP10-KA and PDP10-KL are building consistently with the change to move ^^R down one line in 'comp xxfile' and 'zork xxfile'

PDP10-KS fails, when I switch it to output to tty: instead of a log I get this

:KILL
*:link sys1;ts zork, sys; ts rbye
*:print cfs;..new. (udir)
DSK: CFS; ..NEW. (UDIR) - FILE NOT FOUND
:vk
*:xxfile tty:_lcf;comp xxfile
DB ITS.1651. DDT.1548.
TTY 11
2. Lusers, Fair Share = 133%

<SNAME ?U?
ERROR:  ERROR string found. 

I will keep tinkering with the format and see what I can find

eswenson1 commented 1 year ago

I agree this is frustrating. I've even seen failures on KA and KL when the ^^R is down a line. But I agree, it mostly works. I've also had pdp10-ks work for me too. I'm wondering if it a timing error that results in the time it takes for XXFILE to handle the ^^R versus handle the first line of input to DDT -- and this is why it is not completely consistent.

Clearly, above, the <SNAME "LCF"> that is supposed to go to MDL, is going to DDT. That suggests that the DDT command line that invokes MDL is getting lost. It is clearly in the XXFILE, so what is happening.

HOWEVER, one thing I found is that XXFILE attempts to login, and the success or failure of the XXFILE run depends on whether or not, post login, you get the prompt for reading mail or anything other input the LOGIN might evoke. This shouldn't result in any difference during the builds, but when you are invoking XXFILE on your own account, I've noticed the LOGIN file can make it succeed or fail, depending on what it does.

oilcan-productions commented 1 year ago

Another tidbit. If I take zork.tcl out of the build for KS it times out compiling tcheck.

Job ECOMP wants the TTY
$p
DTTY; 716064>>.CALL 716712 (IOT)    $P
"So you re-owned me, eh?  I'm done anyway."
<FILE-COMPILE "tcheck >" "tcheck nbin">$
Input from DSK:.BATCH;TCHECK UJHM26
Output to DSK:.batch;TCHECK nbin
Bounds checking on.
Default declaration is UNSPECIAL.
Output in fast mode.
Temporary output going to:  _TEMPLT >
Running disowned, with record on DSK:.BATCH;TCHECK RECORD
Toodle-oo.
:PROCED
*
The last command timed out.
make: *** [Makefile:164: out/pdp10-ks/rp0.dsk] Error 1
oilcan-productions commented 1 year ago

Finally figured out why simh build is timing out during executing xxfile zork xxfile

<FLOAD "parser nbin">$

FILE-SYSTEM-ERROR
#FALSE ("FILE NOT FOUND" "DSK:LCF;PARSER nbin" 1048576)
FLOAD

the file is missing in the source as far as I can tell

eswenson1 commented 1 year ago

That file should have been created by the COMP XXFILE.

oilcan-productions commented 9 months ago

No matter how much I tweaked timeouts the build still kept failing. Then tried moving the build to SSD connected through USB 3.0 to the Raspberry Pi, no dice. The only thing that made the build work is switching out the SD card to a "brand" name SD card which has more space and faster R/W speeds.