Automated backup. - Githubissues

larsbrinkhoff commented 11 months ago

This is a program to make an unattended backup. It runs DUMP as an inferior, mounts a remote tape, and runs an incremental dump. The tape number is determined as one more than the highest number in the tape database. When the backup has finished, it leaves a log in DRAGON;BACKUP LOG.

The program is installed on TT and HX as DRAGON;WEEKLY BACKUP. Anyone else?

larsbrinkhoff commented 11 months ago

This may look unfamiliar, but is not a typo: DEFINE LOOK [ZZZ

The bracket is there to make the ZZZ dummy have a "balanced" bindclass. It means the macro argument must have balanced brackets, if any. And the argument includes the brackets, if any. This is because the LOOK macro is used either with a plain symbol, or with a [] literal. Without the [ bindclass, the brackets are taken to enclose the macro argument, but not part of the argument.

eswenson1 commented 11 months ago

Thanks. I was looking at that and had no clue what is going on.

larsbrinkhoff commented 11 months ago

A word of warning. The auto backup program uses core links for communication. If everything goes well, they go away after the program completes. However, if something goes wrong, there's a chance there will be data remaining in the core links. This may interfere with a future run.

It's possible to view active core links by making a file listing of any of the core link devices, e.g. CLO:. See SYSDOC;CLO > for details. Often it's possible to empty core links by typing e.g. ^R CLO:BACKUP;DUMP TYO. If there's nothing there, DDT will hang, which is remedied with ^G.

It should be possible to delete core links, but the SYSDOC file warns there is a bug. I seem to have hit this bug, and I had to clear all the core link data structures by runtime patching ITS.

larsbrinkhoff commented 11 months ago

I wanted to empty the core links before use, but I don't know how! .IOT on an empty file just hangs. Anyone know?

eswenson1 commented 11 months ago

I tried running BACKUP on ES and it seems to have gotten stuck -- with the log file locked, of course so I can't see why. PEEK says:

  7 EJS    HACTRN LARS       HANG    >        30   9   0%
 17  EJS    EMACS  EJS       10!0    <        75  27   0%          REALTM
 20  EJS    BACKUP EJS       CLOI    T11       1   0   0%
 21   EJS    DUMP   EAK      CLOI    <        24   1   0%

The fact that it is doing a CLOI, and the SNAME of DUMP is EAK suggests it is prompting for something while only having gotten to the EAK directory when dumping.

PEEK details show:

 Ch Idx Uname  Jname Mode Bks+Wds   Rd%  Pk  File Name
  1  21 EJS    DUMP   R     4+0     16%  3   EAK; EAKPUR 314
  0  20 EJS    BACKUP W     0+475        2   DRAGON; BACKUP LOG

Chaos network connections:
Idx Usr Uname  Jname  State  Ibf Pbf Nos Ack R Win T Foreign Addr  Flag
 33  21 EJS    DUMP   OPEN    15   0  13   0 15   13 BRIDGE 75230
24 buffers, 5 of which are free.

rtape shows this:

Peer 05460: Read record: 40 octets
Peer 05460: Read mark
Peer 05460: Read continuous records
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets
Peer 05460: Read record: 5120 octets

Stopping the DUMP job shows that it is here:

input/   .IOT 1,4
INPUT+1/   JUMPL 4,INPUT1
INPUT+2/   .IOT 3,4
INPUT+3/   POPJ 17,
INPUT1/   .CALL 701 (OPEN)
INPUT1+1/   .LOSE 1400
INPUT1+2/   JRST INPUT

At INPUT.

eswenson1 commented 11 months ago

I killed the BACKUP job and looked at DRAGON;BACKUP LOG. It ended like this:

_DUMP I
TAPE NO=48
 REEL      0 FIRST USER =  %SYS   LAST USER =  _MSGS_

REWINDING
CHECKING INCREMENTAL DUMPTAPE NO     48 CREATION DATE  231104
REEL NO      0 OF INCREMENTAL DUMP
Remote-Tape protocol error--Record-stream input buffer overflow--record too long ?
_
ES>

So it got the error we frequently get. I'm running a cbridge built from the HEAD of the cleaning-windows branch of chaosnet-bridge. And I'm running with an rtape built from the HEAD of the lars/rtape branch of chaosnet-tools.

So I think I'm running with the latest everything.

larsbrinkhoff commented 11 months ago

Rtape is now on the master branch of chaosnet-tools, but there's no change regarding packet handling. I have run complete "DUMP I" backups many times during testing, and I have not seen this error lately. Your host is Linux, right? I'm on some old software: Ubuntu 16/18, Linux 4.15, glibc 2.23/2.27.

eswenson1 commented 11 months ago

Rtape is now on the master branch of chaosnet-tools, but there's no change regarding packet handling. I have run complete "DUMP I" backups many times during testing, and I have not seen this error lately. Your host is Linux, right? I'm on some old software: Ubuntu 16/18, Linux 4.15, glibc 2.23/2.27.

I compared the branch I used with master branch and there was nothing significantly different. I haven’t managed to do a BACKUP yet on ES, although I’ve tried several times. Always fails in the same way. I’m running Ubuntu 20.04 on the Linode that hosts ES. I can try it on my Ubuntu 22.04 laptop and report on the results, but all my 3 ITS systems there have tiny file systems compared to ES. Also they all run pdp10-k* simulators rather than KLH10, as ES does.

larsbrinkhoff commented 11 months ago

Ah, KLH10. I will test that.

eswenson1 commented 11 months ago

I ran BACKUP again on ES and got the same error:

_REMOTE
TAPE SERVER HOST=5401
DRIVE=i231104-49.dump
READ-ONLY? N
REMOTE TAPE REWOUND
_DUMP I
TAPE NO=49
 REEL      0 FIRST USER =  %SYS   LAST USER =  _MSGS_

REWINDING
CHECKING INCREMENTAL DUMPTAPE NO     49 CREATION DATE  231104
REEL NO      0 OF INCREMENTAL DUMP
Remote-Tape protocol error--record type other than data, read-file-mark, or status ?
_

This is the same read error I get when I list a dump.

Note: this time, I used the master branch of chaosnet-tools for rtape. So this still doesn't work for me.

If I simply don't use BACKUP, and use DUMP, I get the same problem while LISTing the tape successfully created by BACKUP. So perhaps we should go ahead and merge this anyway, and continue to track down the problem with RTAPE and CBRIDGE.

larsbrinkhoff commented 11 months ago

Right, it's not likely this is a problem with the BACKUP program, but rather rtape.c, cbridge, or KLH10. I don't have KLH10 set up yet. Eric, have you tested with pdp10-ka? That's what I have been using, and I have not seen the protocol errors for quite a while.

eswenson1 commented 11 months ago

Are you going to address any of my review comments/suggestions? If not, I can approve and we can merge.

eswenson1 commented 11 months ago

I took the incremental backup tape I created with BACKUP on ES and moved it over to the host where EXA (ITS under pdp10-ka) lives. I used DUMP/LIST to list that tape. I did not get the error I reported doing the ICHECK or DUMP/LIST on ES (klh10) on the same tape.

So it does look like the issue is KLH10-related. Please try to run under a KLH10 ITS and see if you can find the issue.

larsbrinkhoff commented 11 months ago

I'm good to go. And I'm building ITS for KLH10 now.

PDP-10 / its

Automated backup. #2254