PDP-10 / its

Incompatible Timesharing System
Other
864 stars 83 forks source link

COMSYS - DM mail demon #1960

Open larsbrinkhoff opened 4 years ago

larsbrinkhoff commented 4 years ago

Communication/mail system by @jh95468.

COMMUD; COMSYS SAVE - Muddle save file.
COMMUD; TS CUMSYS - Run Muddle with COMSYS save file.
LIBCOM - Data directory for COMSYS.

According to COMSYS \@FILES there seems to be 30-odd source files, many (all?) of which are on the ToTS 7005366 tape.

Uses IRS, the information retrieval system.

jh95468 commented 4 years ago

I think you've reached the limits of my organic memory. I vaguely remember "LIBCOM", and the other names sound plausible. Plus 30 source files sounds about right.

IRS was what today you would call a database. IIRC, Mike Broos (MSB) wrote it. During the process, we surfaced a lot of bugs/features of ITS, and had to make changes to ITS in order to be able to have a reliable database. That included things like a system call to make sure that a file was successfully written to the physical disk ("sync" today in Unixese), and then a call to make sure that the directory blocks for that file were also written to disk. We had lots of database corruption before adding those - one of the few advantages of unreliable hardware being that it flushed out issues that had to be addressed in the operating system.

FWIW, we were not alone with such OS problems. Multics was evolving at about the same time. It had a notion that there was no distinction between different types of memory. Everything from magtape to disk to drum to core was all just memory, and Multics moved data around as it deemed useful. Us programmers didn't have to think about it.... Until one day when Multics crashed, it came back up and people noticed that their entire day's work had disappeared. With only a few users on the system, Multics had never felt the need to move anything from main memory to disk, so all the day's work disappeared on rebooting.

eswenson1 commented 1 year ago

I believe there is a conflict between the contents of mail files (e.g. EJS;EJS MAIL) generated by COMSAT and those generated by COMSYS. I noticed that COMBAT ZONE, when it sends email to users, uses a format that is incompatible with RMAIL and BABYL. Not sure we can have both mailers running on the same ITS.

larsbrinkhoff commented 1 year ago

Not sure we can have both mailers running on the same ITS.

I wouldn't expect we can. Is it even an option to run COMSYS, you think? I didn't try it.

We would probably want COMSAT for the "DB/generic/frankenstein" build. If there ever is a "DM ITS" build in the future, that would want COMSYS if possible.

eswenson1 commented 1 year ago

Once I get some of the other outstanding functionality working and added, I’ll take a look at COMSYS. But I am not sure I like that the mail files are incompatible, because it prevents one from using RMAIL or BABYL.

eswenson1 commented 1 year ago

I've tried running COMSYS, and it dies with an MPV error. This results because COMSYS is calling LSRTNS, which is known (currently) to die with the same error. Once we fix LSRTNS, we can try to get COMSYS running.

LSRTNS ticket is #2161.

eswenson1 commented 1 year ago

I've managed to get COMSYS limping on my machine. However, it isn't clear that this daemon was really in use near the latter years of DM's lifetime. According to the DEMSTR program, which launches daemons, the entry for COMSYS was replaced with an entry for COMSAT.

DEMSTR 45 has this:

DEMSIG  COMSYS,0,60.            ; MESSAGE DEMON; PDL 10/25/78

And DEMSTR 49 (the latest in ToTS) has this:

        DEMSIG  COMSAT,0,10.            ; MAIL DEMON; SWG 1/26/83
jh95468 commented 1 year ago

Sorry, I've meant to comment about COMSYS but we've gone through several weeks now of storms, 5+ feet of snow, and long outages of power, Internet, etc. Still raining here today but power et al is back.

Re COMSYS:

I wrote COMSYS, as a research project under Prof. Licklider's (aka "Lick") direction. When I left MIT to join BBN in June 1977, COMSYS was still running but no one else knew much about how the code worked. For a while I used to get pleading mail sent by COMSYS itself to COMSYS-MAINTAINER@MIT-DM, which was automatically forwarded to me. Since that research activity had ended anyway, it was an obvious next step to replace it on DM with COMSAT, which had been running on the other ITSes and was actively maintained by KLH.

There's a story about COMSYS and that research work.

Lick had a vision of computers interconnected by a "Galactic Network", with software assisting humans in doing all sorts of things that people do. One consequence of that vision was that you always had your computer working for you even when you were not logged in to it. That was in sharp contrast to the norm of the era where your computer ran your programs only when you were "dialed up" and "logged in" to it.

COMSYS followed Lick's model. In a sense it was much like an operating system. It did work for many users at the same time. It allowed a user to schedule tasks to be performed at specific times. It allowed users to provide their own programs (written in Muddle so able to do pretty much anything), and for such programs to be triggered on occurrence of events. Most common was running a program when email arrived for you, and having it do things like assigning tags to the message, or forwarding it to others, or replying to it, etc. But you could literally have COMSYS do anything. I recall MSB would send his (large) Muddle programs to it as an email, and have COMSYS run the Muddle compiler and reply with the results.

COMSYS treated each message as a data structure, which could change over time. When a message was first created, it was assigned a unique (to MIT-DM) ID, which I managed to lobby to get included in the official email headers for ARPANET (the Message-ID: field). The idea was that the message would live possibly for a long time, and form part of a chain of messages as it was forwarded, or replied, or archived to the Datacomputer, etc. So, for example, when you replied to a message, the original message would not get copied into your text, but rather a pointer (the Message-ID) would be included. Same with forwarding. The original text of a message would be preserved, but it could be "pointed to" by mechanisms similar to the "anchors" later defined for the Web. That would allow a "message" to be used as a living document, editted, revised, approved, and distributed as is common in office and military non-electronic behavior.

The MME - Military Message Experiment - in the 70s was an example of the linkage between such "research" and the "operational" testbeds of the people funding the whole project.

Anyone who got involved in a chain of messages in the future could access the whole history by using the various Message-IDs. E.g., if someone "forwarded" a message to you, you could actually get at all of the older replies and discussions.

Other aspects were also on the to-do list. For example, the "forums" popular today would simply be emails linked together using MESSAGE-IDs. Functions like "notarization" and "escrow", important in business and office environments, could be handled by introducing other servers, similar to how the Datacomputer served as an archival service for important email. Cpmputers interacting over Lick's "Galactic Network" would do all the sorts of things that people (and their companies and organizations) do when they interact.

COMSYS was just the "mail daemon" and had no explicit user interface. There were two other pieces - the "COMPOSER" (done by EHB I believe) to create new messages, and the "READER" (MSB? PDL?) to provide the user interface for reading, searching, and otherwise manipulating your messages.

In order for all that to work outside of the single MIT-DM environment, the "mail protocol" had to provide the necessary functionality. That resulted in a long and energetic debate, mostly by email and on the HEADER-PEOPLE mailing list that KLH started. Two camps emerged. Our (Lick's) camp wanted a rich protocol that could transfer message details as data structures. The larger camp wanted something very simple that they could implement easily, since they weren't involved in the research area of Lick's vision.

There's some historical evidence of that debate. For example, RFC713 - see https://www.rfc-editor.org/rfc/rfc713 - defined a way to send data structures over the net. It was of course carefully designed to mimic Muddle's data primitives, so encoding and decoding in Muddle would be almost as simple as just using PRINT and READ.

The large camp won, and it was decided that a simple mechanism was needed as a first step to get basic email working. The more powerful mechanisms could follow shortly as the next version of the protocol.

That was 50 years ago.

At the time, the technology available was quite different than today. For example, there were no "relational databases" yet. If I did COMSYS today, every message would be in tables in a database (MYSQL or such would be fine), and data transfers could be handled using some current scheme like XML, JSON, etc.

Hope this bit of history is interesting as a background to the story of COMSYS.

eswenson1 commented 1 year ago

That's great and fascinating info. Thanks, Jack.

I have the daemon set up on my machine, ES, but at present, messages created by the COMSYS-compatible mail program, and written to COMSYS;M >, are getting processed, but fail, for some reason. COMSYS writes out a COMSYS;M-LOST > file for each such message. It seems to also get itself into a loop trying to send messages to COMSYS-MAINTAINER, which fails, as well. I'm struggling to figure out why it is failing.

When I run it in my own process, I don't get very far -- I get these errors:

0 ERROR [TYPE-MISMATCH!-ERRORS MAP <OR FALSE <VECTOR [REST FIX STRING]>> #LOSE *000000000000* SET]
1 SET   [MAP #LOSE *000000000000*]
2 EVAL  [<SET MAP <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>>]
3 COND  [((<SET MAP <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>>
           <AND ,READ-WRITE? <SCROUT "Read Msg Map">>
           <SET MAP
                <MAPF ,VECTOR
                      <FUNCTION (X!-IM-READ)
                              #DECL ((X!-IM-READ) <OR FIX STRING>)
                              <COND (<TYPE? .X!-IM-READ FIX> <MAPRET .X!-IM-READ>) (ELSE <MAPRET <STRING .X!-IM-READ>>)>>
                      .MAP>>
           <SETG COMSYS-ASYLUM-MAP .MAP>)
          (<ERROR MAP-DISAPPEARED!-ERRORS MSG-LOC .MAP>))]
4 EVAL  [<COND (<SET MAP <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>>
                <AND ,READ-WRITE? <SCROUT "Read Msg Map">>
                <SET MAP
                     <MAPF ,VECTOR
                           <FUNCTION (X!-IM-READ)
                                   #DECL ((X!-IM-READ) <OR FIX STRING>)
                                   <COND (<TYPE? .X!-IM-READ FIX> <MAPRET .X!-IM-READ>) (ELSE <MAPRET <STRING .X!-IM-READ>>)>>
                           .MAP>>
                <SETG COMSYS-ASYLUM-MAP .MAP>)
               (<ERROR MAP-DISAPPEARED!-ERRORS MSG-LOC .MAP>)>]

It may be that I'm getting this when the daemon (SYS;ATSIGN COMSYS) is running as well, but PEEK only shows me that the job is in a .SLEEP wait. If I :UJOB the job after detaching it, it doesn't seem to fail, but reenters the .SLEEP loop.

I'm not sure why I can't run it in a non-COMSYS/non-DEMON process....

eswenson1 commented 1 year ago

SYSTEM-ASYLUM, COMSYS-MSG-MAP, and SCRATCH-SPACE are:

,SYSTEM-ASYLUM◊
#ASYLUM [#CHANNEL [4 "READ" "SYSTEM" "ASYLUM" "DSK" "COMDAT" "SYSTEM" "ASYLUM" "DSK" "COMDAT" 163 23748404430 <ERROR
END-OF-FILE!-ERRORS> 0 0 0 0 10 ""] 200 201 ![#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*
#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*!] ![18 216 38 0!] 0 ![2 203 40 0 -1 204 0 0!]
[202 0]]
,COMSYS-MSG-MAP◊
2
,SCRATCH-SPACE◊

        PGS        HIGH WORD             LAST WORD
#PBLOCK [5 #WORD *000000643777* #WORD *000000000000*]
CURRENT LOCATION = #WORD *000000643614*
LOWEST LOCATION  = #WORD *000000632000*
FVC LOCATION     = #WORD *000000000000*
FREE LIST LENGTH = 0
SPEC = #WORD *200000000000*
eswenson1 commented 1 year ago

If you'd like to help debug, I'd love the help. You have an account on ES.

I used the appropriate function in COMSYS to create the COMDAT;SYSTEM ASYLUM, and the other ASYLUM databases.

The error, above, is obviously due to this:

<DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>◊
#LOSE *000000000000*

But "Why?" is the question.

eswenson1 commented 1 year ago

The code that sets up comdat;system asylum and comdat;msg asylum is as follows:

<DEFINE DB-SETUP ("OPTIONAL" (FIRST 1) "AUX" NAME ASY MSGN S)
    #DECL ((NAME S MSGN) STRING (ASY) ASYLUM (FIRST) FIX)
    <INIT-SPACES>
    <MAKE-DATA-BASE <SET NAME <DATUM "COMSYS-SYSTEM-ASYLUM">> 20>
    <MAKE-DATA-BASE <SET MSGN <DATUM "COMSYS-MSG-ASYLUM">>>
    <SET ASY <OPEN-DATA-FILE .NAME>>
    <SET S <ASTRING ,SCRATCH-SPACE .MSGN>>
    <DATA-APRINT .ASY ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE <ALIST ,QUEUE-SPACE 0>>
    <DATA-APRINT .ASY ,COMSYS-MSG-MAP ,SCRATCH-SPACE <AVECTOR ,SCRATCH-SPACE 1 .S>>
    <DATA-PRINTW .ASY ,COMSYS-MSG-MAP .FIRST>
    <CLOSE-DATA-FILE .ASY>
    T>

Just after a call to DB-SETUP, I get the #LOSE from DATA-AREAD. But you can see that DB-SETUP did this:

<DATA-APRINT .ASY ,COMSYS-MSG-MAP ,SCRATCH-SPACE <AVECTOR ,SCRATCH-SPACE 1 .S>>

which should have allowed the DATA-AREAD of the same asylum database (comdat;system asylum) using COMSYS-MSG-MAP, which is 2. So I'm befuddled.

jh95468 commented 1 year ago

Do you know what vintage that COMSYS is?  It's been 50 years (well, 45 at least) since I touched anything Muddle.  So maybe I just don't remember, but I don't recognize any of COMDAT, or ASYLUM, or MAPs or SCRATCH-SPACE.

Perhaps that COMSYS is from after June 1977 when I left?   I recall that Mike Broos (MSB) was working on something called "IRS" for Information Retrieval System, that was a kind of spinoff from COMSYS to create a more general data storage mechanism not just for COMSYS.  If that COMSYS is from 1978 or later, it may be a version that was changed to use IRS.   We actually ran into a lot of bugs/features trying to create a reliable datastore, and had to add code to ITS so a program could make sure that data was written to physical disk, as well as that the related directory file .FILE. (DIR) was also on disk.

In any case, I do remember COMSYS-MAINTAINER.  While I was there, it was an alias for @.***   But I don't recall exactly how that worked.  There may have been a table of aliases somewhere.   Or it may be that there was a Muddle function set to run on receiving any mail for the maintainer that simply forwarded the message to JFH. That would have been easy to implement, just a line of code.

Most actions in COMSYS were driven by using a field of each message called PROCESSING-NEEDED.   To get something to happen to a message you would store something in PROCESSING-NEEDED.   COMSYS would notice that and do the processing when it got around to it.   The idea was that the human user should never have to wait for some lengthy task to complete, but rather have such tasks always run "in the background".   IIRC, COMSYS would also watch the incoming message directory (the place where M> files went), read them in, create a message data structure and associated MESSAGE-ID, and put something like "SENDING" into that message's PROCESSING-NEEDED.

If you can get at the messages that COMSYS is sending to COMSYS-MAINTAINER on failures, they might give more clues about what's going on.   Of course it's likely that nothing works for mail to/from another machine, but COMSYS might be trying to open an NCP connecton for incoming mail and failing because it doesn't work?

Jack

On 3/14/23 17:20, Eric Swenson wrote:

|SYSTEM-ASYLUM|, |COMSYS-MSG-MAP|, and |SCRATCH-SPACE| are:

|,SYSTEM-ASYLUM◊ #ASYLUM [#CHANNEL [4 "READ" "SYSTEM" "ASYLUM" "DSK" "COMDAT" "SYSTEM" "ASYLUM" "DSK" "COMDAT" 163 23748404430 <ERROR END-OF-FILE!-ERRORS> 0 0 0 0 10 ""] 200 201 ![#WORD 000000000000

WORD 000000000000 #WORD 000000000000 #WORD 000000000000 #WORD

000000000000 #WORD 000000000000 #WORD 000000000000 #WORD 000000000000!] ![18 216 38 0!] 0 ![2 203 40 0 -1 204 0 0!] [202 0]] ,COMSYS-MSG-MAP◊ 2 ,SCRATCH-SPACE◊ PGS HIGH WORD LAST WORD #PBLOCK [5

WORD 000000643777 #WORD 000000000000] CURRENT LOCATION = #WORD

000000643614 LOWEST LOCATION = #WORD 000000632000 FVC LOCATION =

WORD 000000000000 FREE LIST LENGTH = 0 SPEC = #WORD 200000000000 |

— Reply to this email directly, view it on GitHub https://github.com/PDP-10/its/issues/1960#issuecomment-1469042499, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLZOWBFJDCOJY73EZ3WSIDW4EDMDANCNFSM4RTYNTNA. You are receiving this because you were mentioned.Message ID: @.***>

jh95468 commented 1 year ago

My guess is that the DATA-AREAD is failing for some unanticipated reason.   If it was expected to occasionally get an error, there would have been code to detect that and schedule COMSYS to try again later.

I don't recall DATA-AREAD at all, so it may be part of MSB's IRS. It may be that some other process (part of IRS) is supposed to be running already, so that COMSYS through DATA-AREAD can "map" its memory space into COMSYS' space.   That would make all of the data structures kept in that other process immediately available to COMSYS.   The "old way" required COMSYS to read in the entire data structure every time it started up, which could take a while. Muddle added the ability to share data between ITS processes, and to corral data structures into distinct "spaces" shared and coordinated between multiple ITS jobs.

I vaguely recall something called MSGIRS.   May be related.....

Jack

On 3/14/23 17:18, Eric Swenson wrote:

That's great and fascinating info. Thanks, Jack.

I have the daemon set up on my machine, ES, but at present, messages created by the COMSYS-compatible mail program, and written to COMSYS;M

, are getting processed, but fail, for some reason. COMSYS writes out a COMSYS;M-LOST > file for each such message. It seems to also get itself into a loop trying to send messages to COMSYS-MAINTAINER, which fails, as well. I'm struggling to figure out why it is failing.

When I run it in my own process, I don't get very far -- I get these errors:

|0 ERROR [TYPE-MISMATCH!-ERRORS MAP <OR FALSE <VECTOR [REST FIX STRING]>> #LOSE 000000000000 SET] 1 SET [MAP #LOSE 000000000000] 2 EVAL [<SET MAP <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>>] 3 COND [((<SET MAP <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>> <AND ,READ-WRITE? <SCROUT "Read Msg Map">> <SET MAP <MAPF ,VECTOR <FUNCTION (X!-IM-READ) #DECL ((X!-IM-READ) ) <COND (<TYPE? .X!-IM-READ FIX> <MAPRET .X!-IM-READ>) (ELSE <MAPRET <STRING .X!-IM-READ>>)>> .MAP>> <SETG COMSYS-ASYLUM-MAP .MAP>) (<ERROR MAP-DISAPPEARED!-ERRORS MSG-LOC .MAP>))] 4 EVAL [<COND (<SET MAP <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>> <AND ,READ-WRITE? <SCROUT "Read Msg Map">> <SET MAP <MAPF ,VECTOR <FUNCTION (X!-IM-READ) #DECL ((X!-IM-READ) ) <COND (<TYPE? .X!-IM-READ FIX> <MAPRET .X!-IM-READ>) (ELSE <MAPRET <STRING .X!-IM-READ>>)>> .MAP>> <SETG COMSYS-ASYLUM-MAP .MAP>) (<ERROR MAP-DISAPPEARED!-ERRORS MSG-LOC .MAP>)>] |

It may be that I'm getting this when the daemon (SYS;ATSIGN COMSYS) is running as well, but PEEK only shows me that the job is in a .SLEEP wait. If I :UJOB the job after detaching it, it doesn't seem to fail, but reenters the .SLEEP loop.

I'm not sure why I can't run it in a non-COMSYS/non-DEMON process....

— Reply to this email directly, view it on GitHub https://github.com/PDP-10/its/issues/1960#issuecomment-1469041257, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLZOWCGEIYUZUKXNMWRPD3W4EDE5ANCNFSM4RTYNTNA. You are receiving this because you were mentioned.Message ID: @.***>

eswenson1 commented 1 year ago

COMSYS was the vintage 1978-1982 mail system. The sources and binaries lived in the COMMUD directory. The COMSYS directory was where the input mail files (COMSYS;M >) would be placed by the MAIL program and programs that wanted to send email. The COMDAT directory was where COMSYS kept its databases (system, queues, message maps, etc.).

COMSYS leverages ASYLUM, which leverages MADMAN. These are low-level data-file management routines. For example, the message database is an ASYLUM database, which contains multiple maps. ASYLUM provides some high-level APIs, like DATA-APRINT, DATA-IPRINT, DATA-AREAD, and DATA-IREAD. These are implemented on top of the MADMAN functions AREAD, APRINT, IREAD, and IPRINT.

Queued messages are initially files in COMSYS, which are MUDDLE VECTORS (within the "[" and "]"). An example of one is:

"WHEN-ORIGINATED"
15733534769
"TO"
("COMSYS-MAINTAINER")
"SUBJECT"
"ERROR IN FILE INPUT"
"TEXT"
"FILE: DSK:COMSYS;M 1
REASON: ERROR IN FILE INPUT
MSG #: --none--
ADDRESSEE:

"
"FROM"
"COMMUNICATION-DAEMON"
"SCHEDULE"
("SENDING")

That is one of the "error" messages that COMSYS tried to send after getting a message file like this one:

"WHEN-ORIGINATED"
15733524350
"SENDER"
"EJS"
"FROM"
"EJS"
"TO"
("ejs" )
"ACTION-TO"
("ejs")
"SCHEDULE" ("SENDING")
"SUBJECT" "test"
"TEXT" "test"
"CONSOLE-MINUTES" 19
"CPU-SECONDS" 0.41159999E-2

I don't think the culprit is networking here. The recipients of both of the above messages is local to the machine running COMSYS (ES).

The issue may be (but I really doubt it) that the ASYLUM databases created by DB-CREATE are not compatible with the ones that COMSYS is trying to read. However, the COMSYS I built was compiled from a consistent set of sources that range from the earliest in 1979 to the latest in 1982. These came from a single ITS backup dump tape.

I did find that the FBINs and NBINs on that tape didn't seem to work properly -- the FBINs would give pure-load-failures -- presumably because my pure library database was not from the same timeframe. The NBINs would give some RSUBR bad format error (can't remember the exact message). So for all those, I recompiled the NBINs (and have't bothered to create FBINs yet -- until I can get it running properly).

There is one oddity that is troubling me. When I run COMSYS interactively (with DEBUG-DEMON? set to T), I get the above error and COMSYS does NOT write out a COMSYS;M-LOST > file, and then get itself in a loop attempting to send those and failing, causing a successively greater number of M-LOST files. However, when I run SYS;ATSIGN COMSYS as a DEMSIG-initiated daemon, I don't see any error messages going anywhere, but I get the loop reading M > files and writing M-LOST > files. If I don't kill the daemon, the COMSYS directory gets more and more filled.

Why, I wonder, is the behavior different when run interactively. It is possible that whereas running interactively causes the error to be signaled and the REPL to enter a recursive READ after displaying an error, but when running as a daemon (with no TTY), it gobbles the error and tries to "recover" by sending a message to COMSYS-MAINTAINERS, which also fails -- thus the infinite loop.

But somehow I doubt that the COMSYS daemon is getting the same error as when I run it interactively to debug. The M-LOST files written suggest an error in the input message (M >). You can see that above. If it was getting the same error as I get interactively, I would have expected the M-LOST file to indicate something similar to what I see interactively.

Thoughts?

jh95468 commented 1 year ago

Curious.  How did you create that M input file?  COMSYS is complaining that the input file is not formatted correctly.   It all looks plausible, except I don't remember anything about CONSOLE-MINUTES or CPU-SECONDS.   Try deleting those.

Jack

On 3/14/23 18:29, Eric Swenson wrote:

COMSYS was the vintage 1978-1982 mail system. The sources and binaries lived in the COMMUD directory. The COMSYS directory was where the input mail files (COMSYS;M >) would be placed by the MAIL program and programs that wanted to send email. The COMDAT directory was where COMSYS kept its databases (system, queues, message maps, etc.).

COMSYS leverages ASYLUM, which leverages MADMAN. These are low-level data-file management routines. For example, the message database is an ASYLUM database, which contains multiple maps. ASYLUM provides some high-level APIs, like DATA-APRINT, DATA-IPRINT, DATA-AREAD, and DATA-IREAD. These are implemented on top of the MADMAN functions AREAD, APRINT, IREAD, and IPRINT.

Queued messages are initially files in COMSYS, which are MUDDLE VECTORS (within the "[" and "]"). An example of one is:

|"WHEN-ORIGINATED" 15733534769 "TO" ("COMSYS-MAINTAINER") "SUBJECT" "ERROR IN FILE INPUT" "TEXT" "FILE: DSK:COMSYS;M 1 REASON: ERROR IN FILE INPUT MSG #: --none-- ADDRESSEE: " "FROM" "COMMUNICATION-DAEMON" "SCHEDULE" ("SENDING") |

That is one of the "error" messages that COMSYS tried to send after getting a message file like this one:

|"WHEN-ORIGINATED" 15733524350 "SENDER" "EJS" "FROM" "EJS" "TO" ("ejs" ) "ACTION-TO" ("ejs") "SCHEDULE" ("SENDING") "SUBJECT" "test" "TEXT" "test" "CONSOLE-MINUTES" 19 "CPU-SECONDS" 0.41159999E-2 |

I don't think the culprit is networking here. The recipients of both of the above messages is local to the machine running COMSYS (ES).

The issue may be (but I really doubt it) that the ASYLUM databases created by DB-CREATE are not compatible with the ones that COMSYS is trying to read. However, the COMSYS I built was compiled from a consistent set of sources that range from the earliest in 1979 to the latest in 1982. These came from a single ITS backup dump tape.

I did find that the FBINs and NBINs on that tape didn't seem to work properly -- the FBINs would give pure-load-failures -- presumably because my pure library database was not from the same timeframe. The NBINs would give some RSUBR bad format error (can't remember the exact message). So for all those, I recompiled the NBINs (and have't bothered to create FBINs yet -- until I can get it running properly).

There is one oddity that is troubling me. When I run COMSYS interactively (with DEBUG-DEMON? set to T), I get the above error and COMSYS does NOT write out a COMSYS;M-LOST > file, and then get itself in a loop attempting to send those and failing, causing a successively greater number of M-LOST files. However, when I run SYS;ATSIGN COMSYS as a DEMSIG-initiated daemon, I don't see any error messages going anywhere, but I get the loop reading M > files and writing M-LOST > files. If I don't kill the daemon, the COMSYS directory gets more and more filled.

Why, I wonder, is the behavior different when run interactively. It is /possible/ that whereas running interactively causes the error to be signaled and the REPL to enter a recursive READ after displaying an error, but when running as a daemon (with no TTY), it gobbles the error and tries to "recover" by sending a message to COMSYS-MAINTAINERS, which also fails -- thus the infinite loop.

But somehow I doubt that the COMSYS daemon is getting the same error as when I run it interactively to debug. The M-LOST files written suggest an error in the input message (M >). You can see that above. If it was getting the same error as I get interactively, I would have expected the M-LOST file to indicate something similar to what I see interactively.

Thoughts?

— Reply to this email directly, view it on GitHub https://github.com/PDP-10/its/issues/1960#issuecomment-1469129354, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLZOWDBONQISRI23JTLZYTW4ELPNANCNFSM4RTYNTNA. You are receiving this because you were mentioned.Message ID: @.***>

eswenson1 commented 1 year ago

I used the SAVE file for the MAIL program (librdr;MAIL87 SAVE). Restored that into a MDL 55 and it started prompting me for fields. I provided them all, and it wrote out COMSYS;M 1.

eswenson1 commented 1 year ago

I removed those fields from my original (MAIL-produced M file), and restarted COMSYS daemon. It is in a .SLEEP right now and hasn't done anything yet.

eswenson1 commented 1 year ago

Here is a short writeup related to COMSYS from 1979. Note the reference to the rewrite of COMSYS:

Reader and Comsys

   A new message reader-composer program named "Reader" was written in the past year
(Lebling).  It was intended to be stop-gap effort to fill in until the
DMS reader-composer is modified to deal with the local environment.  It
was also intended to explore some of the DMS ideas in a different
environment with a different kind of users.

   Reader incorporates, almost unmodified, the office model of
DMS.  It operates on messages produced by the "Comsys"
message daemon, and produces messages for that daemon to transmit.
Messages are maintained in a central store accessible to all users and are
copied to the user's environment once they reach a certain age.

   The message composition part of Reader is the "POD" message composer
that has been running on MIT-DM for several years.  Since it is a MDL
package, it was essentially trivial to incorporate it bodily into Reader.
Due to the MDL library system, Reader users share this code with those
using POD only as a composer.  Reader also contains a significant amount
of code from Comsys, which is shared with the daemon itself.

   In addition to modifications to enhance interaction with Reader,
Comsys underwent a major redesign of its data storage mechanisms in the
past year.  The major purpose was to speed up the daemon's operations,
which in the previous implementation consumed a significant fraction of
the system resources.  A new data storage mechanism for Comsys was
designed and implemented, using the MADMAN/ASYLUM object-oriented file system
originally developed for DMS [1].
eswenson1 commented 1 year ago

From what I can tell, COMSYS was being developed on DM as late as October, 1982. The latest sources, NBINs, and FBINs are from that timeframe. The note in an 1983 version of COMSTR suggests that it was replaced by COMSAT at that time.

jh95468 commented 1 year ago

MIT-DMS SYS.16.00.pdf

Yes, that makes sense. I've been trying to remember exactly how the user interface for mail worked back in the late 70s, but it's been too long. The "MAIL" program that creates the M files uses the "alternate" way of interacting with COMSYS by dropping a file into a well-known directory. The "main" way of interacting was through the Muddle library routines that provided ways to share data between ITS jobs. The "different kinds of users" probably refers to the military users involved in the MME (Military Message Experiment).

So when COMSYS spots an M file, it would read it in and create a message in the database. It would then "send" the message. For local users (mailbox on DM), that meant simply making the message available to that user. That recipient might not see it until s/he fired up the "READER" program to read mail. Or s/he might have put a Muddle function into the database to be run whenever a message arrived, which could do pretty much anything. I'm pretty sure there was some library function that you could use, which would simply append the text and a simple header for each incoming message to a file in the user's home directory - i.e., what other mail systems typically did in that era.

So if COMSYS is not failing when it reads the M file without the extraneous fields, it's probably just sucking in the new message and placing it in its database, waiting for the recipient to use a READER program to see the new mail. If you can find a READER or COMPOSER program from the same era, it might make a complete mail system....

The attached file is an overview of the structure of the whole system and how it worked. It's at an "architectural" level, so doesn't contain much detail about the actual code. But it may help understand what's going on. I think there's a previous upload online somewhere but it lacks all of the diagrams that I scanned and put at the end of the file. Note that between June 1977 when I left and 1982 is a lot of time, so things may have changed quite a bit. What I did was all prior to MADMAN, ASYLUM, etc. PDL may remember more (but I can't remember his github handle...).

eswenson1 commented 1 year ago

After looking at the code, it would appear if the global variable SRCCHN is defined, then when the daemon is running headless (no TTY), it will emit messages to that channel rather than throw them in the bitbucket. I updated COMDAT;COMSYS GOCODE (loaded on startup for patches), and see the following output:

***** 0936 EST *****
Beginning Demon Run, Version: 400, Wednesday, 15 Mar 19123 09:36 EST
>>>> Top Level Error Interrupt, Args -- [DANGEROUS-INTERRUPT-NOT-HANDLED MPV #
WORD *310000270411*] <<<<
ERROR Priority 100
  [FUNCTION]
  ERROR-FCN!-IGC!-GC!-PACKAGE
  L-HANDLER

0  RUNINT       [#FUNCTION(&) #FRAME ERROR & MPV!-INTERRUPTS #WORD?&?]
1  INTERRUPT    [ERROR!-INTERRUPTS #FRAME ERROR & MPV!-INTERRUPTS #WORD?&?]
2  ERROR        [& MPV!-INTERRUPTS #WORD *310000270411*]
3  NXTPNQ       []
4  EVAL         [<NXTPNQ>]
5  EVAL         [<SET NXT <NXTPNQ>>]
6  EVAL         [<NOT <SET NXT <NXTPNQ>>>]
8  EVAL         [<OR ,STOP? <NOT <SET NXT <NXTPNQ>>>>]
10 EVAL         [<COND (<OR ,STOP? <NOT <SET NXT <NXTPNQ>>>> <RETURN T>)>]
12 EVAL         [<REPEAT () <&> <&> <&> <&> <&> <SET OM .M!-IM-DMN> &..>]
13 EVAL         [<BCKDO>]
15 EVAL         [<REPEAT () <BCKDO> <UPDMSG T> <&> <FDO> <&>>]
17 EVAL         [<REPEAT () <&> <SETG PENDING <>> <&>>]
19 EVAL         [<COND (<&> <CRASH-FIXUP> &..) (ELSE <SCROUT "&" .INIT &..>)>]
21 EVAL         [<PROG (&) #DECL(&) <&> <PRINC ,HDR> <&> <&> <&> &..>]
22 EVAL         [<WORK>]
23 EVAL         [<SET RES!-IM-DMN <WORK>>]
25 EVAL         [<COND (&) (ELSE <SET RES!-IM-DMN <WORK>>)>]
27 EVAL         [<REPEAT () <&> <&> <AND ,DEBUG-DEMON? <RETURN>> <&>>]
28 EVAL         [<MAKE-RUN>]
30 EVAL         [<COND (<&> "DONE") (<&> <&> &..) (ELSE <&> <&> <&> &..)>]
31 EVAL         [<SAVE-IT 400>]
32 LISTEN       []

3       NXTPNQ
4       NXTPNQ
13      BCKDO
22      WORK
28      MAKE-RUN
31      SAVE-IT

Setting Flag to Stop Background Scan.
**** Closing Script, Wednesday, 15 Mar 19123 09:36 EST ****
jh95468 commented 1 year ago

MPVs were expected when you ran compiled and optimized code, where most or all of the checks on data formats had been optimized away. So a simple bug could easily generate a reference to non-existent memory. The way to debug such problems was to run the code in fully interpreted form, i.e., using the Muddle source instead of a binary. There was a way to finely control such debugging, e.g., by only loading the interpreted version of a single module or function, setting breakpoints at the useful places, etc. But I don't remember at all exactly how that was done.... but in this case I would have (somehow) loaded the interpreted version of NXTPNQ, set a breakpoint in it, and run it again with all the Muddle data checks (DECLs etc.) in effect. NXTPNQ sounds like the function that runs when some PROCESSING-NEEDED task completes, and tries to get the next task and MPVs when there isn't anything left.

jh95468 commented 1 year ago

This is a comment on COMSYS that I sent out recently to the internet-history forum. I thought it might be relevant here too.

With all the discussion of email headers, I offer a little History. What I remember was probably never written down, so you won't find it in RFCs et al.

Professor Licklider (aka Lick) at MIT was my mentor and boss during the 1970-1977 period. Lick had a vision of using computers to facilitate human activity, which he wrote about in several papers and books, discussing his notion of a "galactic network". He spent time at MIT and ARPA, and was very focussed on how to use the new capabilities made possible by the ARPANET. His vision drove the research we did in the "Dynamic Modelling" group at MIT in the 70s.

Part of that vision involved using computers to facilitate human communications and collaboration, in a very general sense. What we now know as "email" was just a part of that. In Lick's thinking, research was about figuring out how to use computers and networks in support of human interactions. The focus was on the networked computers system to automate traditional human activities.

So, for example, "email" existed at the time but the mechanisms on the ARPANET were extremely primitive. There were no "headers" at all, no addresses, no formats. To send an email, you opened an FTP connection to the computer where your addressee had an account, as if you were going to transfer a file, but instead typed the command MAIL . The target computer, if it had as one of its users, would reply something like "Type message, ending with a line containing just a period." You then typed your message, and as soon as you input a line containing just a period, the connection was terminated and that message was accessible to that user as a file somewhere.

Lick's vision was considerably richer. A "message" was something that could have a very long lifetime. It would change over time, as it got passed around from person to person. It would become associated with other messages as people replied and discussed whatever the message was about. It might get saved for posterity. It might be "vetted" by some third party so that its existence and contents could later be verified.

The contents of a message were simple text. ASCII. No fancy fonts. No graphics. No images. No videos. Computers and the I/O devices attached to them only did text. But a message might be a short note, or it could be a 50-page document, or anything in between - basically all of the paperwork that you;d find in a typical office environment.

That vision reflected the reality of how humans actually interacted. For example, in corporate and government environments, a "message" was often passed along a chain of people who had to comment, augment, approve, or otherwise act on that message before it was actually transferred to its addressee. A similar process might happen at the addressee's organization, as the message made its way through the "mail room", and was distributed to various departments along its way to the specified addressee. In a corporation or government environment, the addressee was often too busy to open mail; someone else did that, sorted it by priority (from their perspective), and sent the message to whoever should handle it. Computers could help do such stuff, but only if they had the right information to feed their algorithms.

My task in Lick's group at MIT was to create a "Communications Daemon" that would perform such processing. It would be running all the time, interacting with other such servers elsewhere on the ARPANET, and providing an interface for user-interface programs that others in the group were writing to create various kinds of human interfaces to read, compose, reply, archive, forward, and otherwise process such messages.

"Users" could also of course be computer programs themselves, performing some function by passing messages within the corporation where no human judgement was necessary. An example might be a "message" that was a purchase order from some customer, which could be immediately passed to the accounting, inventory, shipping, etc. computers with no human involvement normally necessary. Humans wanted to send messages. Computers did also.

In the context of the ARPA project, "Messaging" included all such communications, ranging from short ephemeral notes (e.g., "Anybody want to go to the sub shop?") to long-lived documents with business, legal, or military purposes. We used ARPANET to submit proposals, reports, and other such documents which in earlier times would have travelled in the ubiquitous manila envelopes.

Lick was persistent in telling us that we were most definitely not creating "electronic mail". We were creating a Messaging system using the ARPANET. One of the reasons for that was that "mail" was legally the monopoly of the Postal System as granted by Congress almost two centuries earlier. Only the Postal Service could carry "mail". That was a political morass that was worth avoiding, since it could cause the projects' funding to be cut off.

So we built a Messaging system, while others were building mail programs. That led to The Header Wars.

eswenson1 commented 1 year ago

Thanks, Jack, for the PDF of SYS.16.00. Do you have PDFs for others of these MIT-DMS.SYS.* documents? I have the source for 16.00, 16.01, and 16.03 (all about COMSYS), but I didn't have any PDFs (until you posted 16.00).

So I did add an interpreted version of NXTPNQ to COMDAT;COMSYS GOCODE (this is loaded after the image is restored and before the main daemon loop is started). I then ran the daemon and nothing was emitted to the SCRCHN file that I had also opened up before redefining NXTPNQ. The daemon was waiting in a .SLEEP call. I did try to load COMSYS interactively, then load COMDAT;COMSYS GOCODE, and then run NXTPNQ -- it returned #FALSE () with no errors. However, there is a COMSYS;M 1 file present, so it should have found that one mail file to process. It didn't appear to.

I then evaluated ,NXTPNQ in my interactive session, and see that it is an #RSUBR, which means that the interpreted version didn't "take". I noticed that when I run FLOAD COMDAT;COMSYS GOCODE, that I get an error:

*ERROR*
NXTPNQ
ALREADY-DEFINED-ERRET-NON-FALSE-TO-REDEFINE
LISTENING-AT-LEVEL 2 PROCESS 1

But I did SETG REDEFINE to T and that did appear to get evaluated:

,REDEFINE◊
T

I thought setting REDEFINE to T would prevent these ALREADY-DEFINED-* errors. No?

eswenson1 commented 1 year ago

I guess you have to change the LVAL of REDEFINE, not the GVAL. Changing my code to do prior to the redefine of NXTPNQ allows it to be redefined. However, the interpreted version of NXTPNQ fails in the same way. It seems that the issue is much deeper. What I've managed to find out is that the global PENDING-QUEUE is a LIST of three elements -- least that is what it is SETGed to by this code:

<SETG PENDING-QUEUE <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE>>

However, when I try to evaluate, in the REPL, the GVAL of PENDING-QUEUE, the printer is unable to print it. It gets an error in the APPLY method. I looked at the TYPE of PENDING-QUEUEE and it is a LIST. I checked its length and it is 3. The TYPE of the second element is an FSUBR and of the third element is a LOSE (whatever that is). Trying to determine the type of the first element, however, yields:

<TYPE <1 ,ZZZ>>◊

*ERROR*
TYPE-UNDEFINED
TYPE
LISTENING-AT-LEVEL 14 PROCESS 1

This suggests that the value of PENDING-QUEUE, as read from the SYSTEM asylum, is bogus (not a valid MDL object). And this suggests that either ASYLUM or MADMAN is bad. The original FBINs for these yield pure library failure errors, so I deleted these (I don't think there is really anything to do in this case unless I can find the "right" FIX55 and SAV55 files to load into t he pure library).

I believe I recompiled ASYLUM and MADMAN because even the NBINs seemed not to work. So I guess it is possible that the COMPILER is generating bad code, or there is a mismatch in versions somewhere. Geez. This is so frustrating.

jh95468 commented 1 year ago

Re PDFs - I got the PDFs by scanning the paper copies I still had after 50 years. There weren't a lot of docs - as the mantra goes "The documentation is the code..." I'll see if I have any more as PDFs.

I'm remembering a little of how to debug Muddle programs. I've never seen anything like this:

LISTENING-AT-LEVEL 14 PROCESS 1

That means you've been hitting errors and then trying again and hitting more errors, and trying again, ... 14 times.

Chances are that all sorts of variables and data structures are inconsistent due to all of the in-progress layers of code.

Not sure this behavior is explained anywhere -- the basic idea of "LISTENING AT LEVEL x" is that when Muddle encounters a problem, it returns control to the console so you can look around and see what the situation is. It's almost like there's a built-in DDT that Muddle escapes to when it detects something wrong. It's usually LEVEL 2. Generally you would look around to see what's happening, and possibly fix it by changing some data structure, and then continue the program where it stopped, by executing $ or whatever it said to do. That would continue the program so you could see what happened a bit later.

Sometimes, while poking around at LEVEL 2, you might happen to trigger another error and go to LEVEL 3. After fixing that problem you would ERRET back to 2, then ERRET back to the original execution stream.

Getting to LEVEL 14 is probably just asking for trouble.

Note that in this error:

ERROR NXTPNQ ALREADY-DEFINED-ERRET-NON-FALSE-TO-REDEFINE LISTENING-AT-LEVEL 2 PROCESS 1

You could have typed $ which would tell Muddle to go ahead and redefine NXTPNQ and continue running at LEVEL 1.

Someone (NDR?) even wrote a package that somehow enabled you to run Muddle code backwards and forwards repeatedly, to help figure out what was happening.

I'd suggest trying again but don't let yourself get above LEVEL 2 or maybe 3. You need to back out of each error to be safe unless you really know what the code and system are doing.

jh95468 commented 1 year ago

I obviously don't understand github behavior. All of my angle-bracket ERRET angle-bracket comments in the last post just disappeared.

eswenson1 commented 1 year ago

Hi Jack. I know about listener levels, <ERRET> and <ERRET T>. In this particular debugging case, I wasn't really caring that I was not ERRETing -- I was just trying things out.

And yes, I was using <ERRET T> to let the interpreter redefine the function. However, I needed something to do this automatically in a script FLOADed by COMSYS. I found that I could get this to work by setting the LVAL rather than the GVAL of REDEFINE in the file. In other words, I got past this error.

I have a much better idea of what is going on now. Each time a message is processed from the queue, we need to generate a new message id. This is done by the function. The first time this is called, all is good, and we get the message id 1. The second time this is called, we signal the error that we couldn't generate a new message ID. The reason for this appears to be a locking issue. This is really getting to the low levels of ASYLUM and MADMAN, with which you said you weren't familiar.

NEW-MESSAGE-ID is defined as follows (in COMMUD;M-DAC >):

<DEFINE NEW-MESSAGE-ID ("AUX" (TT <HIGH-MESSAGE-ID>) R)
        #DECL ((VALUE TT) FIX (R) <OR FALSE MANIAC>)
        <REPEAT ()
                <COND (<SET R
                            <DATA-PRINTW ,SYSTEM-ASYLUM
                                         ,COMSYS-MSG-MAP
                                         <+ .TT 1>>>
                       <RETURN .TT>)
                      (ELSE <ERROR CANT-ALLOCATE-NEW-ID!-ERRORS .R>)>>>

HIGH-MESSAGE-ID just reads the value of <DATA-READW ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP>. That returns the current value of the monotonically-increasing message id. And the code in NEW-MESSAGE-ID tries to update this value in the msg map in the system asylum. The trouble is this that the first time that this is called, it works fine (I can get it to update from 1 to 2). But the next time NEW-MESSAGE-ID is called, I can't. It returns the CANT-ALLOCATE-NEW-ID error because the result of DATA-PRINTW is #FALSE (5). Looking at the code for DATA-PRINTW (in ASYLUM), we see this:

<DEFINE DATA-PRINTW (DC ID WD "AUX" NUV)
        #DECL ((ID) <OR STRING MANIAC FALSE FIX> (DC) ASYLUM (WD) <PRIMTYPE WORD>
               (NUV) <UVECTOR [4 WORD]>)
        <COND (<OR <TYPE? .ID MANIAC>
                   <SET ID <DATA-OPEN "PRINTW" .DC .ID>>>
               <SET NUV <DATA-FIND .DC <1 .ID>>>
               <DATA-PUT .DC
                         <1 .ID>
                         <PUT .NUV <+ 1 ,NAMMISC> <CHTYPE .WD WORD>>>
               <DATA-CLOSE .DC .ID>
               .ID)>>

The type of .ID is FIX, not MANIAC (an asylum map), so we open the map. The call to DATA-OPEN returns #FALSE (5), which is returned from DATA-PRINTW to NEW-MESSAGE-ID, which generates the error.

Exploring DATA-OPEN -- this is a long, hairy function, so I'm not quoting it here -- we see that we eventually call:

 <SET RESULT
          <COND (<SET NUV <DATA-FIND .DC <1 .ID> .PMODE?>>
...

PMODE in this case is T (because we actually opened the map in PRINTW mode (a writing mode). The result of that call can be seen below (debugging):

<SETG EJS-NUV <DATA-FIND ,SYSTEM-ASYLUM <1 ,EJS-ID> T>>◊
![#WORD *000000000000* #WORD *000000004015* #WORD *000000000030* #WORD *000000000002*!]

EJS-NUV is a global variable I created so as not to disrupt the stack variable NUV.

So that call worked fine. The next thing we do is try to lock the map (for writing). This looks like this:

 (<AND .PMODE? <NOT <DHLOCK <DATA-LOC .DC <1 .ID>>>>>
                        #FALSE (5))

Note the #FALSE (5). That 5 is flagging which of the error exits in DATA-OPEN is happening, and we know that this is what DATA-OPEN is returning to its caller. In any case, the expression:

<DHLOCK <DATA-LOC ,SYSTEM-ASYLUM <1 ,EJS-ID>>>◊

returns:

<DHLOCK <DATA-LOC ,SYSTEM-ASYLUM <1 ,EJS-ID>>>◊
#FALSE ("ALREADY-LOCKED")

Indicating that the map in the asylum is already locked. So this is the problem. The map is not getting unlocked (for some reason). We can update the map once (to allocate the message number 2), but we cannot update it another time (for the next message), because somehow we are not unlocking the map -- we can't write to a locked map.

There is a function called DUNLOCK (by the way, DHLOCK and DUNLOCK are both in MADMAN, not ASYLUM), that is supposed to handle unlocking of either a DHLOCK or a DSLOCK (not sure what the distinction is between the two).

Apparently, it isn't getting called. Now I need to dig into why that is the case.

eswenson1 commented 1 year ago

I obviously don't understand github behavior. All of my angle-bracket ERRET angle-bracket comments in the last post just disappeared.

If you are going to include some funky characters, best to inclose them in backquotes. This will prevent github from "handling" certain characters it considers special. If you want to include a block of code, you can use triple-backquotes to start and end the block.

So for an inline <ERRET T>, I used single backquotes. And for this block:

<DEFINE FOO (X) .X>

I used triple-backquotes.

eswenson1 commented 1 year ago

Unfortunately, the lock/unlock issue is probably going to be more difficult to track down. If we look back at DATA-PRINTW, we see its definition as:

<DEFINE DATA-PRINTW (DC ID WD "AUX" NUV)
        #DECL ((ID) <OR STRING MANIAC FALSE FIX> (DC) ASYLUM (WD) <PRIMTYPE WORD>
               (NUV) <UVECTOR [4 WORD]>)
        <COND (<OR <TYPE? .ID MANIAC>
                   <SET ID <DATA-OPEN "PRINTW" .DC .ID>>>
               <SET NUV <DATA-FIND .DC <1 .ID>>>
               <DATA-PUT .DC
                         <1 .ID>
                         <PUT .NUV <+ 1 ,NAMMISC> <CHTYPE .WD WORD>>>
               <DATA-CLOSE .DC .ID>
               .ID)>>

The DATA-OPEN ends up DHLOCKing the map. But the DATA-CLOSE appears to be calling DUNLOCK, which should unlock it. So this function appears to lock and unlock the map. Why it is NOT getting unlocked is a mystery.

eswenson1 commented 1 year ago

I may be muddying the waters a bit here. I'm actually debugging two versions of COMSYS. One is a SAVE image that I got from ToTS (LIBRDR;COMS29 SAVE). That one is the one that has the issue with the message-id.

The one that I built from source, and saved to LIBRDR;COM400 SAVE, is the one with the MPV in NXTPNQ.

So I'm going back to debugging the one I built from source. Running this interactively gives this:

mud55↑K!
MUDDLE 55 IN OPERATION.
LISTENING-AT-LEVEL 1 PROCESS 1
<RESTORE "librdr;com400 save">◊
***** 1554 EST *****
Restored: DSK:LIBRDR;COM400 SAVE
T
<FLOAD "comdat;comsys gocode">◊
"DONE"
<INIT>◊
T
<NXTPNQ>◊

*ERROR*
DANGEROUS-INTERRUPT-NOT-HANDLED
MPV!-INTERRUPTS
#WORD *000000745127*
LISTENING-AT-LEVEL 2 PROCESS 1

NXTPNQ is actually called from BCKDO, which is called from WORK, which is called from the top-level MAKE-RUN. The only apparent requirement before calling NXTPNQ is the calling of INIT, so I did that above.

A stack trace isn't all that helpful here:

0 ERROR [DANGEROUS-INTERRUPT-NOT-HANDLED!-ERRORS MPV!-INTERRUPTS #WORD *000000745127*]
1 EVAL  [<NXTPNQ>]
2 LISTEN        []
TOPLEVEL

This is using my interpreted version of NXTPNQ, as can be seen by this verification:

,NXTPNQ◊
#FUNCTION (("AUX" QEL (PNQ ,PENDING-QUEUE) (MDB ,MDB) (QPC <>) (QP2B <>) (QS ,QUEUE-SPACE) MSG M R) #DECL ((QPC QP2B) <OR LIST
FALSE> (QEL) LIST (PNQ) <LIST ANY> (VALUE) <OR FALSE PNQENTRY> (MSG) <OR FALSE FIX> (M) FIX (R) ANY (QS) SPACE (MDB) MDBVEC) <SET
MSG <COND (<G? <MVMSG .MDB> 0> <MVMSG .MDB>)>> <MAPR <> <FUNCTION (Q QB "AUX" (QE <1 .Q>) QT) #DECL ((Q QB) LIST (QE) PNQENTRY (QT
) <OR FALSE FIX>) <SET QT <QTIM .QE>> <COND (<OR <NOT .QT> <L? .QT <ITIME>>> <COND (<AND .MSG <==? <QMSG .QE> .MSG>> <SET QPC .QB>
) (ELSE <SET QP2B .QB>)> <AND .QPC .QP2B <MAPLEAVE>>) (ELSE <MAPLEAVE>)>> <REST .PNQ> .PNQ> <COND (.QPC <SET QEL <2 .QPC>> <
DEQUEUE .QS .QPC> .QEL) (.QP2B <SET QEL <2 .QP2B>> <COND (<SET R <MREAD <SET M <QMSG .QEL>>>> <SCROUT .M ": Queued message">) (
ELSE <SCROUT ">>>> Error <<<< " .M ": " .R>)> <DEQUEUE .QS .QP2B> .QEL)>)
eswenson1 commented 1 year ago

The expression in that function NXTPNQ that fails is this:

<REST ,PENDING-QUEUE>◊

*ERROR*
DANGEROUS-INTERRUPT-NOT-HANDLED
MPV!-INTERRUPTS
#WORD *000000734576*
LISTENING-AT-LEVEL 3 PROCESS 1

PENDING-QUEUE is a LIST:

<TYPE ,PENDING-QUEUE>◊
LIST

Attempting to get the length of that list yields an MPV:

<LENGTH ,PENDING-QUEUE>◊

*ERROR*
DANGEROUS-INTERRUPT-NOT-HANDLED
MPV!-INTERRUPTS
#WORD *000000733430*

See earlier messages about how PENDING-QUEUE gets set. It seems it is read out of an ASYLUM:

<SETG PENDING-QUEUE <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE>>

And the object returned by DATA-AREAD appears to be a "bad" object. DATA-AREAD is part of the ASYLUM package, which I compiled myself. So that calls into question the compiler and/or the sources for ASYLUM (or the underlying MADMAN). I'm going to try to go back to the ToTS binaries for these two packages -- again, I can't use the ToTS FBINs due to pure load failures. But I could generate my own FBINs for these.

jh95468 commented 1 year ago

Can you run everything directly from source -- i.e., all interpreted rather than compiled?   The interpreter has lots more error checking and should never encounter MPVs.   It would be slower of course, but an emulated PDP-10 is way more powerful than the real PDP-10 all this code used to work on.  All of the compiler work was driven by a need to get every last bit of performance out of the CPU and memory, at the cost of making debugging often much more difficult.

Also, the main requirement for a MESSAGE-ID was that it be unique within any particular machine.  For testing purposes, you might just temporarily replace the failing get-a-new-ID code with a call to RANDOM, which should guarantee uniqueness good enough for testing. If you're running interpreted you can just edit the relevant function on the fly, change the code to generate a random ID, and continue running.

Jack

On 3/15/23 17:06, Eric Swenson wrote:

The expression in that function NXTPNQ that fails is this:

|<REST ,PENDING-QUEUE>◊ ERROR DANGEROUS-INTERRUPT-NOT-HANDLED MPV!-INTERRUPTS #WORD 000000734576 LISTENING-AT-LEVEL 3 PROCESS 1 |

PENDING-QUEUE is a LIST:

|<TYPE ,PENDING-QUEUE>◊ LIST |

Attempting to get the length of that list yields an MPV:

|<LENGTH ,PENDING-QUEUE>◊ ERROR DANGEROUS-INTERRUPT-NOT-HANDLED MPV!-INTERRUPTS #WORD 000000733430 |

See earlier messages about how PENDING-QUEUE gets set. It seems it is read out of an ASYLUM:

|<SETG PENDING-QUEUE <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE>> |

And the object returned by DATA-AREAD appears to be a "bad" object. DATA-AREAD is part of the ASYLUM package, which I compiled myself. So that calls into question the compiler and/or the sources for ASYLUM (or the underlying MADMAN). I'm going to try to go back to the ToTS binaries for these two packages -- again, I can't use the ToTS FBINs due to pure load failures. But I could generate my own FBINs for these.

— Reply to this email directly, view it on GitHub https://github.com/PDP-10/its/issues/1960#issuecomment-1471007348, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLZOWFT2LR3XU5D3VVNOWDW4JKP3ANCNFSM4RTYNTNA. You are receiving this because you were mentioned.Message ID: @.***>

eswenson1 commented 1 year ago

I’ll try running it interpreted. i’m guessing that it won’t all fit in memory without loading at least some subset compiled. however, I can try to load the sources that manipulate PENDING-QUEUE and see if I can make progress. The code in ASYLUM and MADMAN is voluminous and complex though and I suspect the issue is there. First, I’m going to try to use some NBINs from ToTS for those two packages rather than those I compiled myself.

While I could hack the message ID generating function, I suspect I’ll just run into another issue with extracting (or inserting) objects from (or into) ASYLUM data files.

First I’d like to see what MUDDLE data structure the PENDING-QUEUE was before it was stored in the ASYLUM and compare it to that which is extracted from the ASYLUM. I’m guessing the issue is some corruption of the ASYLUM due to bad compiles code or peephole optimization.

I should be able to prove that by running the “right” stuff interpreted. But as I said, the whole COMSYS, ASYLUM, and MADMAN code are, plus all of the runtime required to run it, is very large.

Let me reiterate a point here, because it really adds to the strangeness of it all. As I said earlier, I have tried running a COMSYS SAVE image that I found in ToTS. That one gives the issue with the message ID. The code is all original code — interpreter is from ToTS, SAVE file from ToTS, and pure library from ToTS. If that code ever worked (and I assume it did), it really should still work now — unless there is a bug in the simulator. The message ID code that fails is reading and writing a WORD (FIX) from and to an ASYLUM. It appears to be failing due to the MAP in the ASYLUM’s being locked (and not successfully unlocked).

I have also tried running a COMSYS SAVE image that I created. That one does involve newly-compiled NBINs (but using an original ToTS ECOMP compiler). That image gives the errors with a malformed MDL LIST data structure read out of an ASYLUM. in this case, much of the COMSYS code is newly-compiled, as well as MADMAN and ASYLUM. so there is plenty opportunity for a bug in the compiler to manifest itself. The compiler I’m using is ECOMP (which was labeled as “experimental”) from DM. (I didn’t build it).

In order to create a new COMSYS SAVE image, i had to recompile the COMSYS code because its FBINs got pure load failures. And the NBINs got some RSUBR format error. So I deleted the offending FBINs and NBINs and re-compiled from the sources. I could well have introduced issues if the compiler was bad.

i could try re-compiling with NPCOMP (also from ToTS). This, presumably, wasn’t and “experimental” compiler. Maybe it will do better.

Final note: I have three (presumably working) compilers: PCOMP, NPCOMP, and ECOMP. I can’t use PCOMP because the SAVE image for it doesn’t include a working package/library system. so attempting to compile a source that invokes USE to load a package fails. ECOMP and NPCOMP don’t have that issue, so those are my available compilers.

jh95468 commented 1 year ago

Debugging, especially ancient stuff, is like Zork. It's a Maze od Twisty Little Passages, All Alike.... Sounds like you're doing the right thing.

The locking behavior does sound like the main obstacle. I don't know anything about MADMAN et al, but perhaps you could write a little code that creates and opens a fresh database, writes a number or list into it, and otherwise does something like what COMSYS does when manipulating MESSAGE-ID or PENDING-QUEUE.

It's just a guess, but I'd expect names like HLOCK or SLOCK refer to "hard" or "soft" locks. A hard lock was used when you expected to change something and didn't want anyone else changing it until you were done. A soft lock was used when you just wanted to read something, and didn't want it anyone else to change it until you were done. Only one hard lock could be in place at any time, but there could be any number of soft locks. Perhaps somehow a soft lock got set and never unlocked somewhere inside ASYLUM.

One other observation... "Experimental" could mean a lot of things. Sometimes an experimental compiler, or even Muddle's runtime itself, was just trying out some new clever and often opaque algorithm. But it may also have been a version that tried out some new feature of ITS or some other system component. That happened when, for example, we started doing memory sharing between ITS jobs. So if the experimental compiler (or whatever) doesn't match with the right versions of other stuff, weird behavior is likely.

Very vague memory now, but IIRC Muddle had the ability to "unload" code, at least temporarily, if it was running into memory constraints. I know we did that in CALICO's world of assembly language code, and I think it was also in Muddle. But maybe not. If it is, you wouldn't have to worry about the size of the interpreted code base; it would still run, although more slowly as code gets paged in and out of real memory.

Jack

eswenson1 commented 1 year ago

For the time being, I'm not going to concern myself with the old SAVE file and its issue with NEW-MESSAGE-ID. I'm going to consider it a mystery to be solved at some point in time. I'd rather be able to build COMSYS from sources, create a SAVE file, have that run by SYS;ATSIGN COMSYS, and invokable via DEMSIG -- and, of course, have it work.

I'm going to concentrate on debugging the PENDING-QUEUE issue. While there may be other obstacles, it is clear that resurrecting PENDING-QUEUE from the SYSTEM ASYLUM results in a "bad" object, which causes MPVs when trying to manipulate it. I'm going to find out where PENDING-QUEUE is created (as a, presumably valid MDL LIST), and how it is stored in the ASYLUM, and then see if the resurrected version matches the stored version. I'm assuming the bug is in the resurrection (ASYLUM/MADMAN code), but it could be already "bad" when it is stored and faithfully and equally "bad" when it is retrieved.

I suspect you're right on the DHLOCK and DSLOCK meaning. Again, however, that issue was with an original SAVE file. I have these:

ES   LIBRDR
FREE BLOCKS #0=2089 #1=3327 #2=4618 #3=27416
  2   COM400 SAVE   152   3/14/2023 14:05:44
  3   COMS29 SAVE   135 ! 6/11/1981 13:30:39
  2   COMS30 SAVE   106   10/6/1982 10:44:03
  3   COMS31 SAVE   106 ! 1/27/1983 10:32:32
  2   MAIL87 SAVE   30   3/12/1981 18:08:44
  1   READ80 SAVE   87   3/18/1981 23:06:16

COM400 SAVE is the one I created from recompiled sources -- it gives me the PENDING-QUEUE-related failure. COMS29 SAVE (6/11/1981) is the one that gives me the NEW-MESSAGE-ID (locking/unlocking) failure. Neither COMS30 nor COMS31 run due to pure library incompatibilities. MAIL87 SAVE runs fine -- and is the program to create a piece of mail interactively and have it write to COMSYS;M >. READ80 SAVE is a mail reading program. When you invoke it for the first time, it offers to create an ASYLUM for you (EJS;EJS ASYLUM) in which to store your email. Of course, I have no email in that ASYLUM because I can't get COMSYS to work, so READ80 is not worth thinking about until COMSYS is working again.

I agree with you on the "experimental" front. When I worked on Macsyma and MacLisp, we used to create an "experimental" version that usually ended up being the "next" (new) version after we'd played with it for a while. Sometimes, the experimental version would have issues, and it would get replaced with another experimental version. Eventually, it would become the next version. I only have one ECOMP 55SAVE file and that is dated 5/21/1981. NPCOMP 55SAVE is dated 10/1/1980. and PCOMP 55SAVE is dated 2/27/1980. So PCOMP is oldest, NPCOMP is next, and ECOMP is latest we have. But there is no guarantee that ECOMP is any good. That's why I might retry the whole recompilation experiment with NPCOMP if I suspect the compiler's code generation is the issue.

Regarding unloading code, I believe this is the functionality that the pure library provides. The FBIN files have pointers to RSUBRS in the pure library. They are much faster to load than NBINs because there are no RSUBRs in the FBIN. The first reference to a function causes the pure mapping code to load the RSUBR in from the pure library. When memory is tight, and the pure code is not executing in any MDL process, it can be thrown away. It will get reloaded from the pure library when called again.

Your mention of CALICO leads me to mention that the UI for BATCH is based on CALICO. I have BATCH (the UI) and BATCHN (the daemon) running fine on ES.

eswenson1 commented 1 year ago

I've come up with a simple sequence of steps to reproduce the PENDING-QUEUE problem:

mud55↑K!
MUDDLE 55 IN OPERATION.
LISTENING-AT-LEVEL 1 PROCESS 1
<RESTORE "librdr;com400 save">◊
***** 1339 EST *****
Restored: DSK:LIBRDR;COM400 SAVE
T
<INIT-SPACES>◊
T
<MAKE-DATA-BASE <SET NAME <DATUM "COMSYS-SYSTEM-ASYLUM">>>◊
"DSK:COMDAT;SYSTEM ASYLUM"
<SET ASY <OPEN-DATA-FILE .NAME>>◊
#ASYLUM [#CHANNEL [4 "READ" "SYSTEM" "ASYLUM" "DSK" "COMDAT" "SYSTEM" "ASYLUM" "DSK" "COMDAT" 163 23748404430 <ERROR
END-OF-FILE!-ERRORS> 0 0 0 0 10 ""] 196 197 ![#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*
#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*!] ![-1 199 0 0 -1 200 0 0!] 0 ![-1 201 0 0 -1
202 0 0 -1 203 0 0 -1 204 0 0!] [198 -1]]
<SETG DATA-WRITE-WORD <CHTYPE <ITIME> WORD>>◊
#WORD *165164443656*
<DATA-APRINT .ASY ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE <ALIST ,QUEUE-SPACE 0>>◊
#MANIAC 1
<SOPEN>◊
#ASYLUM [#CHANNEL [5 "READ" "SYSTEM" "ASYLUM" "DSK" "COMDAT" "SYSTEM" "ASYLUM" "DSK" "COMDAT" 163 23748404430 <ERROR
END-OF-FILE!-ERRORS> 0 0 0 0 10 ""] 191 192 ![#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*
#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*!] ![-1 216 0 0!] 0 ![-1 194 0 0 -1 195 0 0!] [
193 -1]]
<SETG PENDING-QUEUE <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE>>◊

*ERROR*
TYPE-MISMATCH
PENDING-QUEUE
<LIST FIX [REST PNQENTRY]>
#LOSE *000000000000*
SETG
LISTENING-AT-LEVEL 2 PROCESS 1
<ERRET>◊

LISTENING-AT-LEVEL 1 PROCESS 1
<DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE>◊
#LOSE *000000000000*

Most of that is setup. Clearly, the call to DATA-APRINT (to store the queue) succeeds, and returns a MANIAC object. But immediately following, a call to DATA-AREAD (to read back the queue) fails, and returns a #LOSE object. It should return the same thing we stored. We stored a (0) (which the LIST value of <ALIST ,QUEUE-SPACE 0>.).

I can also do this:

<DATA-APRINT ,SYSTEM-ASYLUM ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE 3>◊
#MANIAC 1
<DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE>◊
#LOSE *000000000000*

So simply storing a MDL FIX (3) in the asylum map and reading it back out returns a #LOSE -- given the above setup. However, with a simpler test case:

mud55↑K!
MUDDLE 55 IN OPERATION.
LISTENING-AT-LEVEL 1 PROCESS 1
<RESTORE "librdr;com400 save">◊
***** 1348 EST *****
Restored: DSK:LIBRDR;COM400 SAVE
T
<INIT-SPACES>◊
T
<MAKE-DATA-BASE <SET NAME <DATUM "COMSYS-SYSTEM-ASYLUM">>>◊
"DSK:COMDAT;SYSTEM ASYLUM"
<SET ASY <OPEN-DATA-FILE .NAME>>◊
#ASYLUM [#CHANNEL [4 "READ" "SYSTEM" "ASYLUM" "DSK" "COMDAT" "SYSTEM" "ASYLUM" "DSK" "COMDAT" 163 23748404430 <ERROR
END-OF-FILE!-ERRORS> 0 0 0 0 10 ""] 196 197 ![#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*
#WORD *000000000000* #WORD *000000000000* #WORD *000000000000* #WORD *000000000000*!] ![-1 199 0 0 -1 200 0 0!] 0 ![-1 201 0 0 -1
202 0 0 -1 203 0 0 -1 204 0 0!] [198 -1]]
<DATA-APRINT .ASY ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE 42>◊
#MANIAC 1
<DATA-AREAD .ASY ,COMSYS-PENDING-QUEUE ,QUEUE-SPACE>◊
42
,COMSYS-PENDING-QUEUE◊
1
,QUEUE-SPACE◊

        PGS        HIGH WORD             LAST WORD
#PBLOCK [1 #WORD *000000657777* #WORD *000000000000*]
CURRENT LOCATION = #WORD *000000657777*
LOWEST LOCATION  = #WORD *000000656000*
FVC LOCATION     = #WORD *000000000000*
FREE LIST LENGTH = 0
SPEC = #WORD *200000000000*

Note that the FIX (42) is stored and retrieved correctly. I'm not entirely sure what this means but it gives me something to debug.

eswenson1 commented 1 year ago

Aha. I found something:

<DATA-AREAD .ASY 1 ,QUEUE-SPACE>◊
42
<DATA-APRINT .ASY 1 ,QUEUE-SPACE 43>◊
#MANIAC 1
<DATA-AREAD .ASY 1 ,QUEUE-SPACE>◊
#LOSE *000000000000*

After I had successfully written a 42 and read it back, I tried reading it back again (this post), and still got my 42 back. Then, I tried to write out a 43 (which appeared to succeed), but reading it back returned a #LOSE.

I wonder if this is locking issue? I wonder if the SECOND (subsequent) attempt to write fails because the first attempt to write failed to unlock the space?

jh95468 commented 1 year ago

Any way to run just the last test, with the Maniac/Asylum code running interpreted? IIRC, no function call should ever return a LOSE. That means that someone returned a zero instead of an actual error indicator. If you can run it interpreted, it might give more info.

eswenson1 commented 1 year ago

Yes, I did manage to load up a COMSYS with ASYLUM and MADMAN packages running interpreted.

It, too, fails in NXTPNQ, but the error thrown by NXTPNQ is:

*ERROR*
TYPE-MISMATCH
PNQ
<LIST ANY>
T
EVAL
LISTENING-AT-LEVEL 2 PROCESS 1

PNQ is a local variable set to the value of ,PENDING-QUEUE. PENDING-QUEUE was initially set to be the LIST (0) -- this presumably means there is no message in the queue. However, when the value is read out of the ASYLUM, it is, NOW, coming back T, rather than (0). T isn't a valid LIST, hence the error.

It may well have been doing that before, but the compiler was treating the value as a LIST of three elements, and failing badly when the actual value was T.

I have no idea why this value is T. I guess I'll have to debug through DATA-APRINT and DATA-AREAD again (no fun because it is complicated). The the code is using paging -- reading and writing pages and computing places in the pages to read/write. I have no clue what it is actually doing -- there are lots of levels to it.

eswenson1 commented 1 year ago

Please ignore the previous message. I believe the above resulted because I ran both DB-SETUP and then tried to run the daemon in the same MDL instance.

So I created a SAVE file by loading up MADMAN and ASYLUM interpreted, then loaded up COMSYS.

When I start up my SAVE image, I can run DB-SETUP, which will set up the ASYLUM databases as needed for COMSYS. Then, I need to kill that MDL instance, and load up another one with the SAVE file. Here, I can run:

<FLOAD "comdat;comsys gocode">◊
"DONE"
<INIT>◊
T
<SETG DEBUG-DEMON? T>◊
T
<MAKE-RUN>◊

to start up the main daemon loop. It dies trying to process the first message (COMSYS;M 1), as the old COMS29 SAVE version did. It doesn't like the contents of the file:

<MAKE-RUN>◊

±
************************************************************

***** 1626 EST *****
Beginning Demon Run, Version: -1, Thursday, 16 Mar 19123 16:26 EST
>>>> Error During Request File Input, Args -- [#FRAME ERROR TYPE-MISMATCH MAP <OR FALSE <VECTOR [REST FIX STRING]>> [#LOSE
*000000000000* #LOSE *000000000000*] SET] <<<<
0 RUNINT        [#FUNCTION (("TUPLE" STUFF "AUX" IACT "ACT" ERRACT)
                            #DECL ((STUFF) TUPLE (INPACT IACT ERRACT) ACTIVATION (LERR\ !-INTERRUPTS) <SPECIAL FRAME>)
                            <COND (<AND <ASSIGNED? INPACT> <LEGAL? <SET IACT .INPACT>>>
                                   <SCROUT ">>>> Error During Request File Input, Args -- " .STUFF " <<<<">
                                   <SET LERR\ !-INTERRUPTS <CHTYPE .ERRACT FRAME>>
                                   <FRAMES 20>
                                   <FRATM>
                                   <BUFOUT .OUTCHAN>
                                   <DISMISS #FALSE ("ERROR IN FILE INPUT") .IACT>)
                                  (ELSE T)>)
                 #FRAME ERROR
                 TYPE-MISMATCH!-ERRORS
                 MAP!-IM-READ
                 <OR FALSE <VECTOR [REST FIX STRING]>>
                 [#LOSE *000000000000* #LOSE *000000000000*]
                 SET]
1 INTERRUPT     [ERROR!-INTERRUPTS
                 #FRAME ERROR
                 TYPE-MISMATCH!-ERRORS
                 MAP!-IM-READ
                 <OR FALSE <VECTOR [REST FIX STRING]>>
                 [#LOSE *000000000000* #LOSE *000000000000*]
                 SET]
2 ERROR [TYPE-MISMATCH!-ERRORS MAP!-IM-READ <OR FALSE <VECTOR [REST FIX STRING]>> [#LOSE *000000000000* #LOSE *000000000000*] SET]
3 SET   [MAP!-IM-READ [#LOSE *000000000000* #LOSE *000000000000*]]
4 EVAL  [<SET MAP!-IM-READ <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>>]
5 COND  [((<SET MAP!-IM-READ <DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>>
           <AND ,READ-WRITE? <SCROUT "Read Msg Map">>
           <SET MAP!-IM-READ
                <MAPF ,VECTOR
                      <FUNCTION (X!-IM-READ)
                              #DECL ((X!-IM-READ) <OR FIX STRING>)
                              <COND (<TYPE? .X!-IM-READ FIX> <MAPRET .X!-IM-READ>) (ELSE <MAPRET <STRING .X!-IM-READ>>)>>
                      .MAP!-IM-READ>>
           <SETG COMSYS-ASYLUM-MAP .MAP!-IM-READ>)
          (<ERROR MAP-DISAPPEARED!-ERRORS MSG-LOC .MAP!-IM-READ>))]

This looks like now we're having an issue reading in the message map. We're calling DATA-AREAD and attempting to set the local variable MAP!-IM-READ, which is declared <OR FALSE <VECTOR [REST FIX STRING]>>, but the value we've gotten back is [#LOSE *000000000000* #LOSE *000000000000*].

Note that now, DATA-AREAD (part of ASYLUM), is running interpreted.

This call is returning garbage:

<DATA-AREAD ,SYSTEM-ASYLUM ,COMSYS-MSG-MAP ,SCRATCH-SPACE>◊
[#LOSE *000000000000* #LOSE *000000000000*]
,DATA-AREAD◊
#FUNCTION ((DC!-IASYLUM IDX!-IASYLUM SPC "OPTIONAL" (SPD 3) (CHN T) "AUX" ID DAT) #DECL ((DC!-IASYLUM) ASYLUM (ID) <OR FALSE
MANIAC> (SPD) FIX (IDX!-IASYLUM) <OR STRING FIX> (DAT) ANY (SPC) SPACE (CHN) <OR 'T FALSE>) <COND (<SET ID <DATA-OPEN "READ" .
DC!-IASYLUM .IDX!-IASYLUM>> <COND (<SET DAT <DATA-IREAD .DC!-IASYLUM .ID .SPC .SPD .CHN>> <DATA-CLOSE .DC!-IASYLUM .ID> .DAT)>)>)

And if I debug further, I see that the AREAD MADMAN function (also running interpreted) is the culprit:

<AREAD ,SCRATCH-SPACE ,SYSTEM-ASYLUM <CHTYPE <3 .ID> FIX> 3 T>◊
[#LOSE *000000000000* #LOSE *000000000000*]

Unfortunately, this code is hopelessly complicated. Wanna help me debug it on ES?

eswenson1 commented 1 year ago

This intro in MADMAN;MADMAN DOC is really encouraging [NOT]:

MADMAN User Documentation

    This document is, by intention, incomplete in that it provides
the common user with enough information to use the MADMAN package
in its most important functions.  In fact, the new COMSYS demon, which
uses MADMAN and ASYLUM extensively, uses only those primitives within
the first half of this document.  More complete documentation of
internal routines, internal structure, etc. can be found in the file
MADMAN;MADMAN INTERN.  However, being caught reading said document
is grounds for divorce in 38 states and is sufficient for committment [SIC]
to state psychiatric institutions in 45.
jh95468 commented 1 year ago

I can help debug a bit, but all I remember about the memory sharing et al is that it was not only complex but also relied on the particular features of the DM memory/paging hardware. I wonder if there's some problem in the PDP-10 emulator? Perhaps it doesn't quite do what the Muddle code expects? Or doesn't accurately mimic the behavior of DM's hardware? Have you found anything other than COMSYS that also uses MADMAN but seems to work OK?

MADMAN is appropriately named...

eswenson1 commented 1 year ago

I'm doing some experiments with MADMAN and ASYLUM now (completely independently of COMSYS). I've found a failure case simply using ASYLUM (which calls MADMAN), when the "output" or "input" is an ASYLUM. (When I use MADMAN to read and write objects to regular files, everything works fine). MADMAN can do reads/and writes to either files or ASYLUMs. In the latter case, I'm seeing the same failures as COMSYS is. So there is definitely an issue with MADMAN when it reads from and writes to ASYLUMs. I'll post my findings here when I have a sense of what's wrong.

This shows the failure:

<DATA-APRINT .ASY 2 .SPC "foo">◊
#MANIAC 2
<DATA-AREAD .ASY 2 .SPC>◊
T

The value written to the MADMAN space in an ASYLUM is the string "foo". The value retrieved from that same space is T. That's not correct. It also corresponds to what is happening in COMSYS.

Note, the MADMAN;MADMAN DOC and MADMAN;ASYLUM DOC files have very useful information in them.

eswenson1 commented 1 year ago

I compared DM's ITS configuration with that of ES and note the following differences that MAY be relevant (don't know what these mean, so don't KNOW that they are relevant):

DEFOPT SWBLK==0     ;1 => SWAP BLOCKING, 0 => PRIVILEGED USER
DEFOPT PAGPRE==0    ;NO PAGE-IN PREEMPTION

Both of those values are set to 1 on ES. @larsbrinkhoff or @jh95468 do you know what effect these differences would have? Any possibility that these are required to be set to 0 for MADMAN/ASYLUM to work (they are definitely messing with paging pages in and out of memory and flushing them).

larsbrinkhoff commented 1 year ago

I do not know, and I'm pretty sure they haven't been tried at all this millennium. From the looks of it, they tweak the swapping/paging algorithms. It would be an interesting exercise to try to build an ITS as close to DM as possible and see what happens.

jh95468 commented 1 year ago

I don't remember those config variables either, so they may have come after my time at DM. I do remember that we did a lot of tweaking of the memory and paging mechanisms to accommodate large programs with too little memory. This was especially critical with the (big) shared library. I recall you could have a subroutine A that called subroutine B that called subroutine C, and subroutine B could get swapped out while the program was still executing in subroutine C. When subroutine C returned to subroutine B, B would get swapped back in to memory on the fly. Without that reloading, C would get an MPV when it tried to POPJ back to B's address space which didn't exist then.

I don't remember if that had anything to do with "preemption" or "swap blocking", but it might have.

I think the key right now is that little DATA-APRINT followed by DATA-AREAD, which reliably fails. It may be returning T incorrectly. Or perhaps "foo" was somehow stored incorrectly as T for AREAD to then read back. If it's possible, it may be worthwhile to look at the underlying actual data file after the APRINT and see if it contains the string "foo". Must have been easy to do that in DDT, but I've been linuxized for so long that all I can think of is grep.....

eswenson1 commented 1 year ago

I do not know, and I'm pretty sure they haven't been tried at all this millennium. From the looks of it, they tweak the swapping/paging algorithms. It would be an interesting exercise to try to build an ITS as close to DM as possible and see what happens.

I will build a copy of ES that has these enabled and see a) if it builds and runs, and b) if it has any impact on the problem at hard. Of course, ES is a KS, and not a KA like DM was, so there may be issues with these options and KS. If it doesn't work, I'll build a KA, try the test case, and assuming it fails, rebuild with those options and see if there is any change.

eswenson1 commented 1 year ago

I think the key right now is that little DATA-APRINT followed by DATA-AREAD, which reliably fails. It may be returning T incorrectly. Or perhaps "foo" was somehow stored incorrectly as T for AREAD to then read back. If it's possible, it may be worthwhile to look at the underlying actual data file after the APRINT and see if it contains the string "foo". Must have been easy to do that in DDT, but I've been linuxized for so long that all I can think of is grep.....

I'm going to come up with a very simple test case, based on the MADMAN and ASYLUM documentation and then try it with various versions of MADMAN and ASYLUM. I've been running them both interpreted to rule out compiler issues. There is a primitive ALOCK, that is written in MDL assembler. I'm unable, of course, to run that interpreted. I'm also unable to reassemble it, so I'm running an old DM binary.

The issue with re-assembling MADMAN;ALOCK 49 is that it uses some types (e.g. ASYLUM) and some globals, that I can't figure out how to define prior to the assembly. I only have the command line program, ASSEM, working. When I try to load up an compiler, and load in ASSEM (e.g. <USE "ASSEM">, I run into missing libraries, for which I can't find copies in ToTS. If I were able to do a <USE "ASSEM">, of course, then I could set up the pre-conditions required before running FILE-ASSEMBLE to perform the assembly. I would love to get running in a MDL (or compiler image).

I did perform the experiment once of doing a MADMAN's APRINT and then examining an output file to see if the string I wrote to it showed up -- it did. But that was using a file as output, rather than an ASYLUM. I'll try the same thing with an ASYLUM (I can create a small, 1 data page, ASYLUM, and use DDT to search for the string. I'll report on the results.

When I did debugging before, it wasn't the APRINT (or DATA-APRINT) that I thought was the issue, it was the DATA-OPEN and DATA-CLOSE that are automatically done by DATA-AREAD I think it was the DATA-CLOSE that was failing, causing the DATA-AREAD to fail. But I'll retry all these experiments and report on the results.

eswenson1 commented 1 year ago

Here is a simple example, using MADMAN only:

mud55↑K
MUDDLE 55 IN OPERATION.
LISTENING-AT-LEVEL 1 PROCESS 1
<RESTORE "ejs4;comsys 55save">◊
"RESTORED"
<SET SPC <AFIND 1>>◊

        PGS        HIGH WORD             LAST WORD
#PBLOCK [1 #WORD *000000661777* #WORD *000000000000*]
CURRENT LOCATION = #WORD *000000661777*
LOWEST LOCATION  = #WORD *000000660000*
FVC LOCATION     = #WORD *000000000000*
FREE LIST LENGTH = 0
<SET STR <ASTRING .SPC "foo">>◊
"foo"
<SET FW <OPEN "PRINTB" "ejs4;madman file">>◊
#CHANNEL [4 "PRINTB" "MADMAN" "file" "DSK" "EJS4" "MADMAN" "FILE" "DSK" "EJS4" 227 23748404489 80 0 0 0 0 10 ""]
<APRINT .SPC .STR .FW>◊
#CHANNEL [4 "PRINTB" "MADMAN" "file" "DSK" "EJS4" "MADMAN" "FILE" "DSK" "EJS4" 227 23748404489 80 0 0 0 0 10 ""]
<CLOSE .FW>◊
#CHANNEL [0 "PRINTB" "MADMAN" "file" "DSK" "EJS4" "MADMAN" "FILE" "DSK" "EJS4" 227 23085704747 80 0 0 0 0 10 ""]

<SET FR <OPEN "READB" "ejs4;madman file">>◊
#CHANNEL [4 "READB" "MADMAN" "file" "DSK" "EJS4" "MADMAN" "FILE" "DSK" "EJS4" 163 23748404430 <ERROR END-OF-FILE!-ERRORS> 0 0 0 0
10 ""]
<SET X <AREAD .SPC .FR>>◊
"foo"
<CLOSE .FR>◊
#CHANNEL [0 "READB" "MADMAN" "file" "DSK" "EJS4" "MADMAN" "FILE" "DSK" "EJS4" 163 23085704747 <ERROR END-OF-FILE!-ERRORS> 0 0 0 0
10 ""]

This shows that, when a regular ITS file is used as the destination for an APRINT, that an AREAD will read it back in successfully.

ASYLUM is not involved here, just a regular file.