PDP-10 / klh10

Community maintained version of Kenneth L. Harrenstien's PDP-10 emulator.
Other
60 stars 8 forks source link

FreeBSD and runaway shm #12

Closed b4 closed 7 years ago

b4 commented 7 years ago

klh10 absolutely eats every sysvshm resource I have on my FreeBSD system - and them for some reason I can't ipcrm them, despite it saying it succeeded.

No particularly helpful errors are returned indicating it is shm-related, either.

Rhialto commented 7 years ago

With some searching I found the freebsd under qemu I installed previously. It's FreeBSD 10.2. FreeBSD fbsd 10.2-RELEASE FreeBSD 10.2-RELEASE #0 r286666: Wed Aug 12 15:26:37 UTC 2015 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 For now I'm just starting without any configuration or OS:

$ ./kn10-kl
KLH10 2.0j-Rhialto (MyKL) built Jan  8 2017 21:03:56
    Copyright © 2002 Kenneth L. Harrenstien -- All Rights Reserved.
This program comes "AS IS" with ABSOLUTELY NO WARRANTY.

Compiled for unknown-freebsd10.2 on x86_64 with word model USEINT
Emulated config:
         CPU: KL10-extend   SYS: T20   Pager: KL  APRID: 3600
         Memory: 8192 pages of 512 words  (SHARED)
         Time interval: INTRP   Base: OSGET
         Interval default: 60Hz
         Internal clock: OSINT
         Other: MCA25 CIRC JPC DEBUG PCCACHE CTYINT EVHINT
         Devices: DTE RH20 RPXX(DP) TM03(DP) NI20(DP)
[MEM: Allocating 8192 pages shared memory, clearing...done]

KLH10# 

Now the output of ipcs -a is

$ ipcs -a
Message Queues:
T           ID          KEY MODE        OWNER    GROUP    CREATOR  CGROUP                 CBYTES                 QNUM               QBYTES        LSPID        LRPID STIME    RTIME    CTIME   

Shared Memory:
T           ID          KEY MODE        OWNER    GROUP    CREATOR  CGROUP         NATTCH        SEGSZ         CPID         LPID ATIME    DTIME    CTIME   
m       131072            0 --rw------- rhialto  rhialto  rhialto  rhialto             1     33554432         2518         2518 21:09:13 no-entry 21:09:13

Semaphores:
T           ID          KEY MODE        OWNER    GROUP    CREATOR  CGROUP          NSEMS OTIME    CTIME   

When I quit kn10, it is back to no sysv-ipc things in use:

$ ipcs -a
Message Queues:
T           ID          KEY MODE        OWNER    GROUP    CREATOR  CGROUP                 CBYTES                 QNUM               QBYTES        LSPID        LRPID STIME    RTIME    CTIME   

Shared Memory:
T           ID          KEY MODE        OWNER    GROUP    CREATOR  CGROUP         NATTCH        SEGSZ         CPID         LPID ATIME    DTIME    CTIME   

Semaphores:
T           ID          KEY MODE        OWNER    GROUP    CREATOR  CGROUP          NSEMS OTIME    CTIME   

$ 

If you test like this, do you get the same?

b4 commented 7 years ago

Bring up an OS w/ networking and then stop (or try to!) the simulator.

Rhialto commented 7 years ago

I see that terminating the emulator uncleanly may leave some dp* processes left over. They seem to keep references to the shared memory (NATTCH = 1). If you kill them, does that go away? I suspect that shared memory without any users doesn't really use system resources. Although on my normal NetBSD system I see that shared memory with NATTACH = 0 simply does not occur.

Rhialto commented 7 years ago

Oh and running klh10 insside qemu is insanely slow... I can't test much with that really! It has been running now for nearly 10 minutes and it still hasn't given me the question about checking the file system structure.

larsbrinkhoff commented 7 years ago

I'm regularly running Linux in VirtualBox. It's quite fast, good enough to get work done.

Rhialto commented 7 years ago

Qemu is probably a lot slower than virtualbox, which does not exist for NetBSD to my knowledge, unfortunately. For compiling it is fast enough though.

b4 commented 7 years ago

Unless the processes are killed, I will eventually run out of SHM resources and klh10 fail to run - however I need to kill them /all/ impacting every simulator because it's difficult to trace which went to which.

(there's also the fact I can't boot TOPS-20 without attaching truss to a dpni20 process but that's unrelated)

Rhialto commented 7 years ago

At least the device processes get the shm id as command line argument.

Rhialto commented 7 years ago

Since I'm right now not in the position to do some proper investigation I have been pondering a little. So it seems that even on clean shutdown, shm segments are left over, with no users. That doesn't happen for me. Either netbsd simply removes such segments, or whatever klh is doing to clean up isn't working or sufficient on freebsd. A quick look suggests this code is supposed to do it. If this is not wotking, maybe you can verify that and maybe even determine why?

int
os_mmkill(osmm_t mm, char *ptr)
{
#if CENV_SYS_UNIX && KLH10_DEV_DP
    shmdt((caddr_t)ptr);                        /* Detach attached segment */
    shmctl(mm, IPC_RMID,                        /* then try to flush it */
                (struct shmid_ds *)NULL);
#endif
    return TRUE;
}
Rhialto commented 7 years ago

I played around a bit more, and something strange is going on. I was (on NetBSD) attaching debuggers to processes and killing them at inopportune moments, etc, and now I have a bunch of shared memory segments that are claimed to have 2 users, but I can't find those users with ps ax. The termination sequence of ni20 is a bit weird, too. Seemingly it is left running as tops20 shuts down, so that the final clean-up only happens correctly if kn10-kl does that when quitting. Also. my tops20 (panda) install doesn't seem to shut down cleanly any more (it used to do it just fine, as I recall). The instructions for this read

Now that you've done all that, shut the system down with CTRL/E CEASE NOW, wait for it to say "Shutdown complete", then do CTRL-\ SHUT followed by CTRL-\ QUIT

After CTRL-\ SHUT I suspiciously drop to DDT:

@operator 
@enable
$^Ecease now
 TOPS20 Will be shut down IMMEDIATELY 
[Confirm]

[Timesharing is over]
14-Jan-2017 21:11:44 ACJ: System shutdown set by job 9, user OPERATOR, program CEASE, TTY5

        OPERATOR - Wait for the message "Shutdown complete" before
        entering commands to PARSER.
$[ni20_cmdchk: Illop=0 wd=4,,0 qe=1443151]
[dpni20-W: Deleting "tap0" multicast entry: ff:ff:ff:ff:ff:ff]
[ni20_cmdchk: Illop=0 wd=4,,0 qe=1443161]
[dpni20-W: Deleting "tap0" multicast entry: ab:0:0:1:0:0]

SJ  0: OPR>Killed Job 1, User OPERATOR, TTY13, at 14-Jan-2017 21:11:44
SJ  0:  Used 0:00:00 in 0:00:35
[ni20_cmdchk: Illop=0 wd=4,,0 qe=164061]
14-Jan-2017 21:11:47 HSYS: Shutdown complete
[HALTED: FE interrupt]
KLH10> shut
Continuing KN10 at loc 01142476...
**HALTED**
$11B>>SWHLT4#/   XCT CHKADR   BUGCHK/   0 

and then

KLH10> quit
Are you sure you want to quit? [Confirm]y
Shutting down...

which doesn't seem to finish. I'm not sure what's wrong there since basically the main program just sends SIGTERM to the 2 ni20 processes. Or is supposed to do that.

Rhialto commented 7 years ago

It's not related to this issue, but while I was looking around I saw this in dpsup.c:

void dp_exit(register struct dp_s *dp, int res)
{
    if (dp->dp_chpid) {
        kill(dp->dp_chpid, SIGKILL);
        /* Perhaps later wait for that specific child */
    }

    exit(res);
}

It is called by the device proc parent processes. So I think they are killing the wrong process here: themselves, instead of their child. They should kill dp->dp_adr->dpc_frdp.dpx_donpid. Or do I see this wrong?

b4 commented 7 years ago

Here's a syscall trace of klh10 attempting to shutdown on FreeBSD

1696: 63.737992438 write(1,"Are you sure you want to quit? ["...,40) = 40 (0x28) 1696: 63.738010526 SIGNAL 23 (SIGIO) 1696: 63.738034900 ioctl(0,FIONREAD,0xffffdc44) = 0 (0x0) 1696: 63.738059343 sigreturn(0x7fffffffdc80) = 40 (0x28) 1710: 64.030288057 recvfrom(4,"\^A\0^\0\0\M-{\M-Ps\M-U\^S\M^[-"...,1600,0x0,NULL,0x0) = 277 (0x115) 1710: 64.030418865 kill(1696,SIGUSR1) = 0 (0x0) 1696: 64.030468591 read(0,"\0 \M-i",1) = 3 (0x3) 1696: 64.030507700 SIGNAL 30 (SIGUSR1) 1696: 64.030551210 sigreturn(0x7fffffffe340) = 3 (0x3) 1696: 64.104690594 read(0,"\0 \M-i",1) = 3 (0x3) 1696: 64.104733056 SIGNAL 23 (SIGIO) 1696: 64.104843192 ioctl(0,FIONREAD,0xffffe304) = 0 (0x0) 1696: 64.104881463 sigreturn(0x7fffffffe340) = 3 (0x3) 1696: 64.104913170 SIGNAL 23 (SIGIO) 1696: 64.104940896 ioctl(0,FIONREAD,0xffffe304) = 0 (0x0) 1696: 64.104966876 sigreturn(0x7fffffffe340) = 3 (0x3) 1696: 64.248695224 read(0,"\0 \M-i",1) = 3 (0x3) 1696: 64.248781056 SIGNAL 23 (SIGIO) 1696: 64.248820445 ioctl(0,FIONREAD,0xffffe304) = 0 (0x0) 1696: 64.248847961 sigreturn(0x7fffffffe340) = 3 (0x3) 1696: 64.248861720 SIGNAL 23 (SIGIO) 1696: 64.248884208 ioctl(0,FIONREAD,0xffffe304) = 0 (0x0) 1696: 64.248908652 sigreturn(0x7fffffffe340) = 3 (0x3) 1696: 64.248932606 read(0,"y",1) = 1 (0x1) 1696: 64.248982401 write(1,"Shutting down...",16) = 16 (0x10) 1696: 64.249043231 SIGNAL 23 (SIGIO) 1696: 64.249102803 ioctl(0,FIONREAD,0xffffdc44) = 0 (0x0) 1696: 64.249128923 sigreturn(0x7fffffffdc80) = 16 (0x10) 1696: 64.249174179 sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0) 1696: 64.249203791 sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) 1696: 64.249247370 kill(1697,SIGTERM) = 0 (0x0) 1696: 64.249376572 wait4(1697,{ EXITED,val=-2 },WNOHANG,0x0) = 0 (0x0) 1696: 64.249484473 shmdt(0x80069b000) = 0 (0x0) 1696: 64.249542788 shmctl(0x3000a,0x0,0x0) = 0 (0x0) 1696: 64.249612138 sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0) 1696: 64.249651737 sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) 1696: 64.249691266 kill(1698,SIGTERM) = 0 (0x0) 1696: 64.249773955 wait4(1698,{ SIGNALED,sig=SIGTERM },WNOHANG,0x0) = 1698 (0x6a2) 1696: 64.249835273 shmdt(0x80069d000) = 0 (0x0) 1696: 64.249862022 shmctl(0x3000b,0x0,0x0) = 0 (0x0) 1696: 64.249901550 sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0) 1696: 64.249930184 sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) 1696: 64.249973065 kill(1699,SIGTERM) = 0 (0x0) 1696: 64.250030054 wait4(1699,{ SIGNALED,sig=SIGTERM },WNOHANG,0x0) = 1699 (0x6a3) 1696: 64.250068046 shmdt(0x80069f000) = 0 (0x0) 1696: 64.250093398 shmctl(0x3000c,0x0,0x0) = 0 (0x0) 1696: 64.250131879 sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0) 1696: 64.250160303 sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) 1696: 64.250222390 kill(1700,SIGTERM) = 0 (0x0) 1696: 64.250298654 wait4(1700,{ SIGNALED,sig=SIGTERM },WNOHANG,0x0) = 1700 (0x6a4) 1696: 64.250352151 shmdt(0x8006a1000) = 0 (0x0) 1696: 64.250378340 shmctl(0x3000d,0x0,0x0) = 0 (0x0) 1696: 64.250426948 sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0) 1696: 64.250455582 sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) 1696: 64.250520672 kill(1701,SIGTERM) = 0 (0x0) 1696: 64.250601475 wait4(1701,{ SIGNALED,sig=SIGTERM },WNOHANG,0x0) = 1701 (0x6a5) 1696: 64.250642610 shmdt(0x8006a3000) = 0 (0x0) 1696: 64.250668311 shmctl(0x3000e,0x0,0x0) = 0 (0x0) 1696: 64.250706583 sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0) 1696: 64.250734937 sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) 1696: 64.250787805 kill(1702,SIGTERM) = 0 (0x0) 1696: 64.250836623 wait4(1702,{ SIGNALED,sig=SIGTERM },WNOHANG,0x0) = 1702 (0x6a6) 1696: 64.250871961 shmdt(0x8006a5000) = 0 (0x0) 1696: 64.250897103 shmctl(0x30010,0x0,0x0) = 0 (0x0) 1696: 64.250935095 sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0) 1696: 64.250963729 sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) 1696: 64.250992224 shmdt(0x8006a7000) = 0 (0x0) 1696: 64.251018273 shmctl(0x10014,0x0,0x0) = 0 (0x0)

Yet:

[root@green /usr/home/csmelosky/vm/PDP-10/marley]# ipcs | grep "65556" [root@green /usr/home/csmelosky/vm/PDP-10/marley]#

killing dpni20 manually causes kn10-kl to notice, but not proceed.

1710: 445.370801895 SIGNAL 15 (SIGTERM) 1709: 445.370833811 SIGNAL 15 (SIGTERM) 1710: 445.370986618 process killed, signal = 15 1709: 445.371011271 process killed, signal = 15

and there is it hanging - dprpxx et al exited correctly, dpni20 is still running.

I have several processes.

(NOTE: my copy is patched to send SIGTERM)

b4 commented 7 years ago

I changed dp_term() to just fprintf something so I could see what the last thing before shm cleanup was:

Shutting down...fall down go boom (we're at dp_term) fall down go boom (we're at dp_term) fall down go boom (we're at dp_term) fall down go boom (we're at dp_term) fall down go boom (we're at dp_term) fall down go boom (we're at dp_term) fall down go boom (we're at dp_term)

[csmelosky@green ~/vm/PDP-10/build/bld-kl]$ ps aux | grep "dpni" csmelosky 1991 0.0 0.1 12756 2448 1 I+ 20:08 0:00.00 /usr/home/csmelosky/vm/PDP-10/klh10-bin/dpni20 -DPM:65570 csmelosky 1992 0.0 0.1 12756 2448 1 I+ 20:08 0:00.00 /usr/home/csmelosky/vm/PDP-10/klh10-bin/dpni20 -DPM:65570

I think it's not killing dpni20 and as a result is trying to remove in-use segments...which should succeed unless I am misreading shmctl(2)'s manpage.

 IPC_RMID     Removes the segment from the system.  The removal will not
      take effect until all processes having attached the segment
      have exited; however, once the IPC_RMID operation has taken
      place, no further processes will be allowed to attach the
      segment.  For the operation to succeed, the calling
      process's effective uid must match shm_perm.uid or
      shm_perm.cuid, or the process must have superuser privi-
      leges.
Rhialto commented 7 years ago

If you try the following in tops-20:

@enable
$knildr
KNILDR>halt 0

does this cleanly terminate dpni20 for you? Somehow, it doesn't work for me at the moment, and I was sure that it did before. I added a "quit" ipc command to dpni20 so that it can clean up network stuff when it is terminated, and I tested that several times. But now it somehow does not work any more. I must have messed something up....

b4 commented 7 years ago

The latest commits seem to have fixed the problem.

Rhialto commented 7 years ago

Ok, that sounds like the issue can be closed. (I'll ignore the comment to the contrary in this thread that seems to have been deleted)