PDP-10 / klh10

Community maintained version of Kenneth L. Harrenstien's PDP-10 emulator.
Other
60 stars 8 forks source link

Tape read failure #4

Open larsbrinkhoff opened 7 years ago

larsbrinkhoff commented 7 years ago

There seems to be some problem reading from tapes. This is an extract from my configuration file (based on run/ksits/klh10-pi.ini):

devdef mta0 rh1.0 tm02 fmtr=tm03 type=tu45
devmo mta0 minsys.tape
load @.nsalv-260-u
go

minsys.tape is a tape image created with itstar. It works fine with SIMH. @.nsalv-260-u is the ITS disk salvager from run/ksits.

When I type go and then mark<ESC>g, the disk formatting procedure works ok. (Answer 0, y, 0, n, 3000, foobar.)

But next, if I type tran<ESC>g to restore files from a tape, things sometimes go badly. (Answer 0, y.)

On some of my host machines, the start of a file is found, but then I get

Tape offline or controller not ready
*** ERROR *** THE SYSTEM MAY NOT BE BROUGHT UP

On some other host machines, sometimes all files are restored, or the error appears in the middle of the tape.

If we look into the source code of NSALV 260, we see

MREAD2: MOVEI B,<%TMD16_8.>+<%TMFCD_4.>+TMBTS
        IOWRI B,%TMTC           ;Tell TM03
        MOVEI B,%TMNOP          ;You may need this to set status values
        IOWRI B,%TMCS1          ; but I'm not really sure
        IODRI B,%TMCS1
        TRNE B,%TM1GO           ;Wait till done
         JRST .-2
        IODRI B,%TMFS           ;Get status
        TRNE B,%TMSOL           ;On line?
         TRNN B,%TMSFR          ;Formatter ready?
          LOSE "~&Tape offline or controller not ready"

So presumably, the tape controller is ready, but the tape formatter is not.

Rhialto commented 7 years ago

Does the problem occur on faster machines? I'm asking because in the past there has already been a problem with that (with other devices though) with a sort of workaround, as in commit 79e1a9e89855a410b9bed65a0947d919c267abb6 . Although, on the other hand, the network, disk and tape use a separate process and InterProcessCommunication to handle the work, so this may work out differently.

larsbrinkhoff commented 7 years ago

It may be worst on my fastest machine.

I rebuilt KLH10 with those changed options, but I don't think it made any difference.

Using NSALV to restore files works very reliably in SIMH, so I'll try to add some debug logging to both emulators to see what the difference is.

larsbrinkhoff commented 7 years ago

Looking at the TM03 tape controller right before the failure. This is from KLH10 with debug=1 passed to the mta0 devdef.

Tape #1
.;DSKDMP RP06
[tm03_rdreg: r1/ 14400]
[tm03_wrreg: r11/ 2000 = 2000]
[tm03_wrreg: r0/ 4070 = 1]
[tm_cmdxct: 01]
[tm03_rdreg: r0/ 4000]
[tm03_rdreg: r1/ 14400]
Tape offline or controller not ready
*** ERROR *** THE SYSTEM MAY NOT BE BROUGHT UP

The last rdreg is interesting. It has the "tape online" bit set, but the "formatter ready" bit is off. Compare with SIMH:

Tape #1
.;DSKDMP RP06
TU read MTFS(r5): 10650
TU write MTTC(r15) 2000
TU write MTCS1(r0) 1
TU read MTCS1(r0): 4200
TU read MTFS(r5): 10650

There is a 6 in the last line where there was a 4 in KLH10. That't the "formatter ready" bit.

larsbrinkhoff commented 7 years ago

So apparently, the TM_WCF and TM_RDF case in tm_cmdxct turns off the ready bit, TM_SRDY. Now, if I just comment out that line, everything works fine! Not saying that is the right fix, of course.

Rhialto commented 7 years ago

I have just pushed a fix so that you can compile with -DKLH10_DEV_DPTM03=0 (in mk/top-ks-its.mk). This eliminates the device process for the tape and changes a lot of timings and such. You should give that a try. It never turns off the TM_SDRY bit, if I read the code correctly:

    sts |= TM_SDRY;     /* Assume "slave" always there */

Apart from that, in this code that you mentioned,

    case TM_WCF:    /* Write Check Forward (same as Read Forward) */
    case TM_RDF:    /* Read Forward */
...

#if KLH10_DEV_DPTM03
    if (!tm_io(tm, 1)) {        /* Errors handled by tm_io now */
        TMREG(tm, RHR_CSR) &= ~TM_1GO;  /* Turn off GO */
    }
    TMREG(tm, RHR_STS) &= ~TM_SDRY; /* Drive busy, note GO still on! */
#else
    (void) tm_io(tm, 1);    /* Errors handled by tm_io now */
    tm_cmddon(tm);      /* Do post-xct stuff, includes clearing GO */
#endif
    return;     /* IO operations don't trigger ATTN when done */

it smells like the clearing of TM_SDRY races with tm_io(), where it causes messages to be exchanged with the device process. It may well be that the command is executed before the bit is cleared, and then it likely won't be set again (although I didn't trace that in detail).

When the device process signals, with SIGUSR1, that the command is completed, tm03_evhsdon() is called, which calls tm_cmddon() which calls ts_ssta() which sets TM_SDRY. I'm not sure if this is called actually from the signal handler, i.e. asynchronously.

In any case, it looks safer to invert the order: try putting the line that clears TM_SDRY before the call to tm_io(). There are other places where it is done in that order, and multiple places where the order seems suspect.

Rhialto commented 7 years ago

Alternatively, if the signal handler doesn't directly call the completion routine, in which case I'm looking for cases where GO is turned off, and TM_SDRY is not turned back on by that time.

It is in theory possible that tm_io() returns 0, in which case GO is cleared (indicating the command has completed) (and no message to the DP is sent), but also TM_SDRY is cleared. And that one won't be set again for a while (since the DP won't signal anything back). At least not until tm_ssta() is called.

So in this case, the code (also in a few instances) should be changed such that it reads

if KLH10_DEV_DPTM03

if (!tm_io(tm, 1)) {        /\* Errors handled by tm_io now _/
    TMREG(tm, RHR_CSR) &= ~TM_1GO;  /_ Turn off GO _/
} else {
    TMREG(tm, RHR_STS) &= ~TM_SDRY; /_ Drive busy, note GO still on! */
}

else

(which of course is incompatible with my previous suggestion to try, but it has the advantage that the comment "note GO still on" is more accurate in this case)

Now it seems unlikely that tm_io() randomly returns 0 from time to time. That's a weak point of this theory.

larsbrinkhoff commented 7 years ago

Some more testing with an unmodified KLH10:

This seems consistent with a timing-dependent race condition.

I don't quite have the knowledge to see what the best fix would be. Are there some tests you want me to run? I can insert logging to view the state changes for the GO and SRDY bits; in both the failing and successful scenarious.

Rhialto commented 7 years ago

When the DP (device process) has processed a command it sends a signal. Given that I tend to think now that the signal just sets a flag, which gets checked in some specific locations (with INSBRKTEST()) there can't really be a race condition as in comment 259556278 but more likely something like the next one. So one thing to log could be all places where GO gets turned off and check if SRDY is not on.

The normal completion path would be something like

Somewhere this process must go wrong.

Now in the tape reading code you show way above, if I read it correctly, the actual command isn't RDR but NOP. NOP always turns off GO right away and doesn't even involve the device process, and doesn't mess with SRDY. So it might be some interplay between the RDR command and the NOP It can't be that the NOP command gets given before RDR completes, because you can't even write a new command while GO is still on from the previous one (https://github.com/Rhialto/klh10/blob/master/src/dvtm03.c#L1154).

Rhialto commented 7 years ago

I see that the TM_ER3 command gets sent to the device proc, but on completion, tm_cmddon() doesn't call tm_ssta() for it. So this command would seem to leave the SRDY bit off.

larsbrinkhoff commented 7 years ago

-DKLH10_DEV_DPTM03=0 works fine, so I'm happy to use that as a workaround. I'm just not sure how to pass it to the build system.

EDIT: fixed by adding the new CONFFLAGS_USR thing.