UPTIME leaves stopped MLDEV jobs around

eswenson1 commented 5 years ago

I have UPTIME running hourly on ES. Frequently, I log into to find one or more dead (stopped) MLDEV jobs lying around:

 14 PFTHMG JOB.05 SYS       _10!0    ? DSN     3   3   0%          PCLSR .VALUE  20
 16 PFTHMG JOB.04 SYS        10!0    ? DSN     3   1   0%          PCLSR .VALUE  22

The PC is 2067 (LOSE2:) and the instruction there is .VALUE

The instruction that precedes this instruction (2066, LOSE1:) also has a .VALUE. This means that the actual reason for stopping was the .VALUE at LOSE1. The only way MLDEV transfers to LOSE1 is by doing a "JRST @CMDTB(A)" where A is 0. Inspecting the A register at the time of fault shows that it is, indeed, 0. The command table is laid out thus:

;SLAVE REPLY ROUTINES DISPATCHED TO WITH LH(B) = - # ARGS MUST BE READ FROM NET
CMDTB:  LOSE1
    NTDI
    OPNSI
    OPNSO
    EOF
    FDELST
    XNOOP
    XACC
    XCALL
    XICLOS
    XOCLOS
    XIOC

This suggests that the slave (MLSLV) responded with a reply of 0, which the above code suggests is invalid. The call to jump through the command table is preceded by:

NTINT:  PUSHJ P,NTCHK
     JRST GOLOOP
    PUSHJ P,REPLY1
    CAIL A,RMAX
     .VALUE
    MOVEM B,LREPLY'
    JRST @CMDTB(A)

The call to NTCHK (from the comments) tests the input network channel status. If there is input, NTCHK does a skip return. Since we are getting to the JRST @CMDTB(A), we must have had input and skip-returned. We then call REPLY1. This processes a reply from the server (MLSLV on the destination host). This does an NTIIOT and then moves the result to the A register. It checks for a RLOGIN reply and doesn't skip, but returns the result in A.

I was curious to see what host MLDEV was talking to. I think the device should be found in the JBCDEV address. That value was $1'DSK'. Maybe it is in FDEVN -- that value was also $1'DSK'. I expected to find a foreign address here. My UPTIME DATA file has entries for DB, ES, NO, and UP in it. I wonder if this is happening due to the entry for ES and it is the attempt to read M.F.D. (FILE) on DSK: that is dying. I checked both of the dead jobs I saw in PEEK, and they both have $1'DSK' as the device.

Does anyone know what happens when you attempt to access a file on the ES: device from ES? -- that is, rather than using DSK:, you use the machine name as the device? JOBDEV ES is linked to ATSIGN MLDEV. But is there any short circuiting done because the device (ES:) is specifying the localhost?

I'm going to patch my UPTIME DATA to not use ES, but rather some non-existent host, like MC. I will look to see if this cures my problem of dead MLDEV jobs.

I'm wondering, since all the deal MLDEV jobs are for the device DSK, whether these only fail when the host is my localhost.

eswenson1 commented 5 years ago

I just confirmed that MLDEV replaces the contents of JBCDEV and FDEVN with $1' DSK' in some situation. I expected to have it do this if the host specified in these locations originally matched that of the local host. But I don't see how it does this. Here is the relevant code:

    MOVEI A,BUF
    PUSHJ P,NETWRK"HSTLOOK
     .VALUE             ;Host not in host table?
    MOVEM A,HOSTN'          ;Host number
    MOVEM TT,NETWRK'        ;Network number
    PUSHJ P,NETWRK"HSTUNMAP     ;Don't need host table any more
     .VALUE
    LDB A,[3000,,JBCDEV]        ;Right-hand four characters
    JUMPN A,.+2
     MOVE A,[SIXBIT /  DSK/]
    LSH A,14
    MOVEM A,JBCDEV
    MOVEM A,FDEVN

Note the "MOVE A,[SIXBIT / DSK/]" instruction. It is executed when the contents of A is not equal to 0. The instruction that setup A was "LDB A,[3000,,JBCDEV]" and the comment is right-hand four characters. The original contents of JBCDEV was $1' ES', so the right-hand four characters would be $' ES', which would never be equal to 0. So it appears it replaces these with $' DSK' in all cases (unless, I guess, the value of JBCDEV was 0).

So the fact that I found $1' DSK' in these locations in my post-mortem doesn't appear to prove that we're trying to perform file system i/o on ES. I would have had to look at the HOSTN location, where the host number is stored after the host lookup (and before patching JBCDEV and FDEVN). Next time I see one of these jobs dead, I'll look in HOSTN and NETWRK (the latter holds the network number).

eswenson1 commented 5 years ago

And it appears no short-circuiting is happening. Even when the target host is the same as the local host, we still attempt a Chaosnet connection to the localhost. And when I step through MLDEV (debugging using the OJB device), it works every time. No JSRT to LOSE1, no .VALUE, and I see the resulting DSK:M.F.D. (FILE) output on the console of the i/o requesting job.

So I'm no close to understanding why sometimes this fails.

eswenson1 commented 5 years ago

I've also confirmed that in this "local" case (when the MLDEV device is ES: and that is the machine we're on), that ITS runs an MLSLV handler, with which MLDEV is communicating. So it must be MLSLV, that in some situations is returning a bogus reply that causes MLDEV to .value.

eswenson1 commented 5 years ago

I wonder how you debug one of these handlers (MLSLV). They may well be launched with the JOB device, or maybe some other mechanism in ITS, but I don't think they are subject to the same translation hack for OBJ: because they are not run as jobs under a DDT, and thus I don't know how to make translations apply to them.

eswenson1 commented 5 years ago

It's easy to debug MLSLV. Just run it under DDT! It detects that it is being run this way and listens for connections over the chaos net (if jname is MLDEV) or tcp net (if jname is TCP). I just set up a debugging MLDEV in one HACTRN, a ES:M.F.D. (FILE) requesting job in another HACTRN, and an MLSLV in a third HACTRN. I'm able to debug both ends of the MLDEV <-> MLSLV connection that way.

But of course, i can't reproduce the .VALUE problem described above, and a perusal of the MLSLV code shows no obvious reason it would reply with a 0 rather than the various reply codes, which are defined to be >1.

eswenson1 commented 5 years ago

Well, it turns out it is not an issue with a local host/device. I got the same error with an MLDEV instance talking to NO over chaosnet. I found a crashed MLDEV whose PC was LOSE1. We get there when the response from MLSLV is 0 (not a valid response). I verified that the host name that was resolved was "NO" and that the resolved host number was 40700003150, which is NO's chaos address. So we either have a bug in MLSLV or data is getting garbaged somehow.

I did noticed a couple times when I did a NO^F from ES, that the directory listing had garbage in the middle of it. So I suspect the issue is not with MLSLV, but with the robustness of the chaosnet over UDP that we're using with the emulator.

bictorv commented 5 years ago

I've not dug into this, but I've had occasional garbage in NO files accessed from UP. It seems very odd, since I'd expect the UDP checksums to catch this - unless it's garbage already when packed into UDP packets. I've only seen this from NO, never from other machines.

If it is indeed a UDP-related problem, it should be solved by using the new Chaos-over-TLS option instead. My experience (using it from home to MX-11, and the rest of the net through it) is that it's very often quite a bit faster than the UDP option, although I don't really see why. I'll send you instructions for setting it up, separately.

larsbrinkhoff commented 5 years ago

Try removing NO from the data file for say a week.

eswenson1 commented 5 years ago

In trying to debug why my new EX ITS can't talk to my ES ITS, I noticed that when EX (or ES) tries to send chaosnet packets, when it sends through no.nocrew.org, I get very, very frequent checksum errors in the UDP packets:

01:38:31.609220 IP (tos 0x0, ttl 64, id 62274, offset 0, flags [DF], proto UDP (17), length 92)
    ip-10-0-0-55.ec2.internal.42043 > static.74.191.99.88.clients.your-server.de.42042: [bad udp cksum 0x223e -> 0x9402!] UDP, length 64
        0x0000:  4500 005c f342 4000 4011 256a 0a00 0037  E..\.B@.@.%j...7
        0x0010:  5863 bf4a a43b a43a 0048 223e 0101 0000  Xc.J.;.:.H">....
        0x0020:  0900 0025 0668 cc21 0671 8827 0004 0004  ...%.h.!.q.'....
        0x0030:  6f43 6e6e 6365 6974 6e6f 6420 656f 2073  oCnnceitnod.eo.s
        0x0040:  6f6e 2074 7865 7369 2074 7461 7420 6968  on.txesi.ttat.ih
        0x0050:  2073 6e65 e064 0668 0671 288c            .sne.d.h.q(.

I don't see any of these when sending/receiving packets through up.update.uu.se.

eswenson1 commented 5 years ago

Spoke too soon. Now I'm seeing them through up.update.uu.se too. Oh well. Thought I'd found the source of the issues with NO.

larsbrinkhoff commented 5 years ago

It has seemed to me that NO talking Chaosnet is particularly slow. If the checksum errors are more frequent to/from NO than other, that may explain the slowness.

eswenson1 commented 5 years ago

Just had another dead MLDEV -- same error -- bad response from MLSLV. And the target host was NO, again. When these die, I load symbols and check the HOSTN contents:

*hostn/'BOJ◊:   .IOT IOP,PKTBUF+12   =40700003150

As I think I mentioned earlier, MLDEV changes the JBCDEV value from 'NO to 'DSK prior to opening the connection to the remote host, so HOSTN is the place to look to see which host it was talking to.

bictorv commented 5 years ago

Googling the UDP checksum problem, it seems it might be related to "hardware offloading of checksums", which seems to be used e.g. in virtual environments. It can be turned off with ethtool, perhaps using ethtool --offload eth0 rx off tx off but YMMV, of course. Maybe you should google yourselves first. :-)

eswenson1 commented 5 years ago

Wanna try that on NO, Lars? I can report on whether the errors go away.

larsbrinkhoff commented 5 years ago

Ok, tx checksum offloading is off. Couldn't change rx.

bictorv commented 5 years ago

Eric, did you see any any differences at your end after the offloading change?

eswenson1 commented 5 years ago

I haven't seen any MLDEV hangs from NO since your change and UPTIME is always showing up-to-date up times for NO.

eswenson1 commented 5 years ago

Bummer. I got a dead MLDEV job on ES. It's PC was the same .value (LOSE1) And the HOSTN indicated it was NO that caused the issue. So I guess the tx checksum fix didn't fix this.

eswenson1 commented 5 years ago

While I haven't yet seen any MLDEV jobs dead on ES or EX, as a result of bad data being returned by MLSLV on one machine or the other, I have seen some data corruption when I've tried to list directories or retrieve files between the two machines. This is the same corruption that I've seen when using the chaosnet over udp with NO. I think the chaosnet-over-udp implementation in KLH10 is just not robust with respect to errors. When I ran ES, EX, and the chaosnet bridge on the same linux VM, I had no issues. When (due to memory issues on my instance) I moved EX to another instance, and thus had to go over the Internet for my chaosnet traffic between ES and EX, I started seeing corruption -- not always, of course, but periodically.

eswenson1 commented 5 years ago

While I haven't yet seen any MLDEV jobs dead on ES or EX, as a result of bad data being returned by MLSLV on one machine or the other, I have seen some data corruption when I've tried to list directories or retrieve files between the two machines. This is the same corruption that I've seen when using the chaosnet over udp with NO. I think the chaosnet-over-udp implementation in KLH10 is just not robust with respect to errors. When I ran ES, EX, and the chaosnet bridge on the same linux VM, I had no issues. When (due to memory issues on my instance) I moved EX to another instance, and thus had to go over the Internet for my chaosnet traffic between ES and EX, I started seeing corruption -- not always, of course, but periodically.

For example, here is a fragment of a directory listing of a directory on ES from EX:

  0   TS     BKG    21   3/22/1978 04:40:04
  0   TS     CLD    1   3/9/1979 19:53:03
  0   TS     D    ≥·@@EE<j^IADp@EMhdfiMd≠∀↓↓@@@αRL@@↓↓↓∀@↓↓·@@E↓·@h=I\^be]`@bIiLrtIL4∀@↓A···  TS     EG     6   1/16/1979 17:01:17
  0   TS     EL     1   1/27/1978 07:14:13
  0   TS     EM     6   1/16/1979 17:10:43

larsbrinkhoff commented 5 years ago

Interesting. At least there's a clear indication something's wrong.

bictorv commented 5 years ago

Could you post your "devdef chaos" and "link chudp" lines from you configs? It's really strange that none of the checksums (both UDP and CHUDP use them) detect this. Did you double-check the checking in your version of dpchaos.c (should be in chaostohost_chudp(), search for ch_checksum)?

If the checksums work, the error would be somewhere before sending the packet, or after receiving it. Concurrency bug?

PDP-10 / its

UPTIME leaves stopped MLDEV jobs around #1491