9fans / plan9port

Plan 9 from User Space
https://9fans.github.io/plan9port/
Other
1.63k stars 322 forks source link

devdraw: GUI programs hang on AIX #398

Closed bhuntsman closed 3 years ago

bhuntsman commented 4 years ago

On AIX, devdraw seems to hang any time it would draw a character on the screen. This occurs in any program that uses devdraw, including src/libdraw/test.c. In sam, the hang can be triggered by typing a character in the command window, or displaying the menu. Upon right-clicking, the following is displayed:

Screen Shot 2020-05-14 at 4 02 36 PM

I enabled some of the debugging statements in devdraw and libdraw, and I can see that the window still responds to mouse position events, window resizes, and even receives character input, but only a limited quantity after. Approximately 3-4 additional keyboard events get processed before devdraw totally locks up and stops responding to any events.

This has always been an issue on AIX, but I haven’t been very interested in the GUI programs until recently.

I could use some intuition as to if this sounds like a problem with the font code, X11 draw code, or perhaps even libthread.

rsc commented 4 years ago

How to debug:

  1. Try updating to incorporate the latest changes. (You will need to redo the edit to mkmk.sh that says installbsd instead of install, which was blown away by rebuilding that script using "mk mkmk.sh". We need to figure out a better way to bring that back for real.)
  2. export LIBTHREAD=pthreadperthread
  3. Run sam as before and make it hang.
  4. Attach gdb to the samterm and devdraw processes (you want the higher-numbered ones when there is a pair) and run "thread apply all where".

Thanks.

bhuntsman commented 4 years ago

Ok, here's what we have running:

 $ ps -ef | grep sam
     ben  9502924    10354924    0 14:34:11  pts/0  0:00 samterm samterm (devdraw)
     ben  10354924   15663122  120 14:34:11  pts/0  0:03 samterm
     ben  14549002    7667848    0 14:34:11  pts/0  0:00 sam
     ben  15663122   14549002    0 14:34:11  pts/0  0:00 samterm
$

I assume you want the (devdraw) process. On AIX, we have only have dbx, not gdb. But here we go:

$ dbx -a 9502924
Waiting to attach to process 9502924 ...
Successfully attached to devdraw.
warning: Directory containing devdraw could not be determined.
Apply 'use' command to initialize source path.

Type 'help' for help.
reading symbolic information ...
stopped in read at 0xd0122fcc ($t2)
0xd0122fcc (read+0x28c) 80410014            lwz   r2,0x14(r1)
(dbx) thread
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t1     wait                 running  93323343     k   no   sys
>$t2     run                  running  72941713     k   no   sys  read
(dbx) where
read(??, ??, ??) at 0xd0122fcc
serveproc(v = 0x20039598), line 146 in "srv.c"
procmain(p = 0x2005d9e8), line 384 in "thread.c"
startprocfn(v = 0x2003b968), line 99 in "pthread.c"
(dbx) thread 1
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t1     wait                 running  93323343     k   no   sys
(dbx) where
read(??, ??, ??) at 0xd0122fcc
serveproc(v = 0x20039598), line 146 in "srv.c"
procmain(p = 0x2005d9e8), line 384 in "thread.c"
startprocfn(v = 0x2003b968), line 99 in "pthread.c"
(dbx)

Just for some more info, and curiosity, lets see what the other two are up to... PID 15663122, per above:

$ dbx -a 15663122
Waiting to attach to process 15663122 ...
Successfully attached to samterm.
warning: Directory containing samterm could not be determined.
Apply 'use' command to initialize source path.

Type 'help' for help.
reading symbolic information ...
stopped in read.read at 0xd0122fcc ($t1)
0xd0122fcc (read+0x28c) 80410014            lwz   r2,0x14(r1)
(dbx) thread
 thread  state-k     wchan    state-u    k-tid   mode held scope function
>$t1     run                  running  83886173     k   no   sys  read
(dbx) where
read.read(??, ??, ??) at 0xd0122fcc
_threadsetupdaemonize(), line 151 in "daemonize.c"
p9main(argc = 1, argv = 0x2ff22298), line 840 in "thread.c"
main.main(argc = 1, argv = 0x2ff22298), line 10 in "main.c"
(dbx)

And PID 10354924:

$ dbx -a 10354924
Waiting to attach to process 10354924 ...
Successfully attached to samterm.
warning: Directory containing samterm could not be determined.
Apply 'use' command to initialize source path.

Type 'help' for help.
reading symbolic information ...
stopped in _event_sleep at 0xd0544924 ($t2)
0xd0544924 (_event_sleep+0x584) 80410014            lwz   r2,0x14(r1)
(dbx) thread
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t1     run                  running  78643213     u   no   sys  _string                                                                                                        
>$t2     run                  blocked  54657225     k   no   sys  _event_sleep                                                                                                   
 $t3     wait      0xf1000e0000b014c8 running  79626241     k   no   sys                                                                                                         
 $t4     wait      0xf1000e0000b128c8 running  90243225     k   no   sys                                                                                                         
 $t5     wait      0xf1000a05d0356e50 running  71041093     k   no   sys                                                                                                         
(dbx) where
_event_sleep(??, ??, ??, ??, ??, ??) at 0xd0544924
_event_wait(??, ??) at 0xd054544c
_cond_wait_local(??, ??, ??) at 0xd0553e38
_cond_wait(??, ??, ??) at 0xd0554444
pthread_cond_wait(??, ??) at 0xd05550dc
_procsleep(r = 0x20029a40), line 60 in "pthread.c"
procscheduler(p = 0x20029948), line 441 in "thread.c"
_threadswitch(), line 259 in "thread.c"
channel.chanalt(a = 0x20045980), line 306 in "channel.c"
_chanop(c = 0x20029868, op = 1, p = 0x20045a28, canblock = 1), line 326 in "channel.c"
chansend(c = 0x20029868, v = 0x20045a28), line 334 in "channel.c"
mouse._ioproc(arg = 0x20029438), line 63 in "mouse.c"
procmain(p = 0x20029948), line 384 in "thread.c"
startprocfn(v = 0x20028e58), line 99 in "pthread.c"
(dbx) thread 1
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t1     run                  running  78643213     u   no   sys  _string
(dbx) where
_event_sleep(??, ??, ??, ??, ??, ??) at 0xd0544924
_event_wait(??, ??) at 0xd054544c
_cond_wait_local(??, ??, ??) at 0xd0553e38
_cond_wait(??, ??, ??) at 0xd0554444
pthread_cond_wait(??, ??) at 0xd05550dc
_procsleep(r = 0x20029a40), line 60 in "pthread.c"
procscheduler(p = 0x20029948), line 441 in "thread.c"
_threadswitch(), line 259 in "thread.c"
channel.chanalt(a = 0x20045980), line 306 in "channel.c"
_chanop(c = 0x20029868, op = 1, p = 0x20045a28, canblock = 1), line 326 in "channel.c"
chansend(c = 0x20029868, v = 0x20045a28), line 334 in "channel.c"
mouse._ioproc(arg = 0x20029438), line 63 in "mouse.c"
procmain(p = 0x20029948), line 384 in "thread.c"
startprocfn(v = 0x20028e58), line 99 in "pthread.c"
(dbx) thread 3
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t3     wait      0xf1000e0000b014c8 running  79626241     k   no   sys
(dbx) where
_event_sleep(??, ??, ??, ??, ??, ??) at 0xd0544924
_event_wait(??, ??) at 0xd054544c
_cond_wait_local(??, ??, ??) at 0xd0553e38
_cond_wait(??, ??, ??) at 0xd0554444
pthread_cond_wait(??, ??) at 0xd05550dc
_procsleep(r = 0x20029a40), line 60 in "pthread.c"
procscheduler(p = 0x20029948), line 441 in "thread.c"
_threadswitch(), line 259 in "thread.c"
channel.chanalt(a = 0x20045980), line 306 in "channel.c"
_chanop(c = 0x20029868, op = 1, p = 0x20045a28, canblock = 1), line 326 in "channel.c"
chansend(c = 0x20029868, v = 0x20045a28), line 334 in "channel.c"
mouse._ioproc(arg = 0x20029438), line 63 in "mouse.c"
procmain(p = 0x20029948), line 384 in "thread.c"
startprocfn(v = 0x20028e58), line 99 in "pthread.c"
(dbx) thread 4
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t4     wait      0xf1000e0000b128c8 running  90243225     k   no   sys
(dbx) where
_event_sleep(??, ??, ??, ??, ??, ??) at 0xd0544924
_event_wait(??, ??) at 0xd054544c
_cond_wait_local(??, ??, ??) at 0xd0553e38
_cond_wait(??, ??, ??) at 0xd0554444
pthread_cond_wait(??, ??) at 0xd05550dc
_procsleep(r = 0x20029a40), line 60 in "pthread.c"
procscheduler(p = 0x20029948), line 441 in "thread.c"
_threadswitch(), line 259 in "thread.c"
channel.chanalt(a = 0x20045980), line 306 in "channel.c"
_chanop(c = 0x20029868, op = 1, p = 0x20045a28, canblock = 1), line 326 in "channel.c"
chansend(c = 0x20029868, v = 0x20045a28), line 334 in "channel.c"
mouse._ioproc(arg = 0x20029438), line 63 in "mouse.c"
procmain(p = 0x20029948), line 384 in "thread.c"
startprocfn(v = 0x20028e58), line 99 in "pthread.c"
(dbx) thread 5
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t5     wait      0xf1000a05d0356e50 running  71041093     k   no   sys
(dbx) where
_event_sleep(??, ??, ??, ??, ??, ??) at 0xd0544924
_event_wait(??, ??) at 0xd054544c
_cond_wait_local(??, ??, ??) at 0xd0553e38
_cond_wait(??, ??, ??) at 0xd0554444
pthread_cond_wait(??, ??) at 0xd05550dc
_procsleep(r = 0x20029a40), line 60 in "pthread.c"
procscheduler(p = 0x20029948), line 441 in "thread.c"
_threadswitch(), line 259 in "thread.c"
channel.chanalt(a = 0x20045980), line 306 in "channel.c"
_chanop(c = 0x20029868, op = 1, p = 0x20045a28, canblock = 1), line 326 in "channel.c"
chansend(c = 0x20029868, v = 0x20045a28), line 334 in "channel.c"
mouse._ioproc(arg = 0x20029438), line 63 in "mouse.c"
procmain(p = 0x20029948), line 384 in "thread.c"
startprocfn(v = 0x20028e58), line 99 in "pthread.c"
(dbx)

I hope that provides some helpful information! Thank you!

rsc commented 4 years ago

dbx is showing the same stack for all the threads, so it's unfortunately not too helpful. There's only one thread with that particular stack (mouse._ioproc), I'm sure.

One stab in the dark would be to try

export font=/lib/font/bit/lucm/unicode.9.font

before starting sam.

bhuntsman commented 4 years ago

That may also have been an artifact of dbx not behaving the way I expected. I tried setting font per your suggestion, but that did not change the behavior. However, I also pulled in all the latest changes, and used the dbx "where all" command. Here we go:

$ ps -ef | grep sam
     ben 17694836 20185276 120 21:33:48  pts/0  0:51 samterm
     ben 19005588 17694836   0 21:33:48  pts/0  0:00 samterm samterm (devdraw)
     ben 19071186  7667848   0 21:33:48  pts/0  0:00 sam
     ben 19267606 18088064   0 21:34:46  pts/2  0:00 grep sam
     ben 20185276 19071186   0 21:33:48  pts/0  0:00 samterm

devdraw (PID 19005588):

$ dbx -a 19005588
Waiting to attach to process 19005588 ...
Successfully attached to devdraw.
warning: Directory containing devdraw could not be determined.
Apply 'use' command to initialize source path.

Type 'help' for help.
reading symbolic information ...
stopped in read at 0xd0122fcc ($t2)
0xd0122fcc (read+0x28c) 80410014            lwz   r2,0x14(r1)
(dbx) thread
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t1     wait                 running  83689597     k   no   sys
>$t2     run                  running  93192207     k   no   sys  read
(dbx) where all
Thread $t1
warning: Thread is in kernel mode, not all registers can be accessed.
.() at 0x5c012200
x11-screen.select(__fds = 6, __readlist = 0x2ff1bf30, __writelist = 0x2ff1df30, __exceptlist = 0x2ff1ff30, __timeout = (nil)), line 244 in "time.h"
xloop(), line 301 in "x11-screen.c"
gfx_main(), line 274 in "x11-screen.c"
threadmain(argc = 2, argv = 0x2ff22268), line 85 in "srv.c"
threadmainstart(v = (nil)), line 813 in "thread.c"
procmain(p = 0x20037b48), line 384 in "thread.c"
p9main(argc = 3, argv = 0x2ff22264), line 871 in "thread.c"
main(argc = 3, argv = 0x2ff22264), line 10 in "main.c"
Thread $t2
read(??, ??, ??) at 0xd0122fcc
serveproc(v = 0x20038748), line 146 in "srv.c"
procmain(p = 0x2005cb98), line 384 in "thread.c"
startprocfn(v = 0x2003ab18), line 99 in "pthread.c"
(dbx)

PID 20185276:

$ dbx -a 20185276
Waiting to attach to process 20185276 ...
Successfully attached to samterm.
warning: Directory containing samterm could not be determined.
Apply 'use' command to initialize source path.

Type 'help' for help.
reading symbolic information ...
stopped in read.read at 0xd0122fcc ($t1)
0xd0122fcc (read+0x28c) 80410014            lwz   r2,0x14(r1)
(dbx) thread
 thread  state-k     wchan    state-u    k-tid   mode held scope function
>$t1     run                  running  90374233     k   no   sys  read
(dbx) where all
Thread $t1
read.read(??, ??, ??) at 0xd0122fcc
_threadsetupdaemonize(), line 151 in "daemonize.c"
p9main(argc = 1, argv = 0x2ff22298), line 840 in "thread.c"
main.main(argc = 1, argv = 0x2ff22298), line 10 in "main.c"
(dbx)

PID 17694836:

$ dbx -a 17694836
Waiting to attach to process 17694836 ...
Successfully attached to samterm.
warning: Directory containing samterm could not be determined.
Apply 'use' command to initialize source path.

Type 'help' for help.
reading symbolic information ...
stopped in _event_sleep at 0xd0544924 ($t2)
0xd0544924 (_event_sleep+0x584) 80410014            lwz   r2,0x14(r1)
(dbx) thread
 thread  state-k     wchan    state-u    k-tid   mode held scope function
 $t1     run                  running  78184693     u   no   sys  _string
>$t2     run                  blocked  78381249     k   no   sys  _event_sleep
 $t3     wait      0xf1000e0000ad68c8 running  91160609     k   no   sys
 $t4     wait      0xf1000e00001b6cc8 running  81461303     k   no   sys
 $t5     wait      0xf1000a05d0356580 running  78643293     k   no   sys
(dbx) where all
Thread $t1
_string(dst = 0x200294c8, pt = (...), src = 0x20028ef8, sp = (...), f = 0x20029ec8, s = "", r = 0x100520d0, len = 16777213, clipr = (...), bg = (nil), bgp = (...), op = SoverD), line 84 in "string.c"
string.string(dst = 0x200294c8, pt = (...), src = 0x20028ef8, sp = (...), f = 0x20029ec8, s = "new"), line 13 in "string.c"
paintitem(m = 0x200294c8, menu = 0x20003158, textr = (...), off = 0, i = 0, highlight = 0, save = (nil), restore = (nil)), line 99 in "menuhit.c"
menupaint(m = 0x200294c8, menu = 0x20003158, textr = (...), off = 0, nitemdrawn = 6), line 134 in "menuhit.c"
menuhit(but = 3, mc = 0x20029568, menu = 0x20003158, scr = (nil)), line 239 in "menuhit.c"
menu3hit(), line 131 in "menu.c"
threadmain(argc = 1, argv = 0x2ff22298), line 144 in "main.c"
threadmainstart(v = (nil)), line 813 in "thread.c"
procmain(p = 0x200181d8), line 384 in "thread.c"
p9main(argc = 1, argv = 0x2ff22298), line 871 in "thread.c"
main.main(argc = 1, argv = 0x2ff22298), line 10 in "main.c"
Thread $t2
_event_sleep(??, ??, ??, ??, ??, ??) at 0xd0544924
_event_wait(??, ??) at 0xd054544c
_cond_wait_local(??, ??, ??) at 0xd0553e38
_cond_wait(??, ??, ??) at 0xd0554444
pthread_cond_wait(??, ??) at 0xd05550dc
_procsleep(r = 0x20029770), line 60 in "pthread.c"
procscheduler(p = 0x20029678), line 441 in "thread.c"
_threadswitch(), line 259 in "thread.c"
channel.chanalt(a = 0x20047980), line 306 in "channel.c"
_chanop(c = 0x20029598, op = 1, p = 0x20047a28, canblock = 1), line 326 in "channel.c"
chansend(c = 0x20029598, v = 0x20047a28), line 334 in "channel.c"
mouse._ioproc(arg = 0x20029568), line 63 in "mouse.c"
procmain(p = 0x20029678), line 384 in "thread.c"
startprocfn(v = 0x20029458), line 99 in "pthread.c"
Thread $t3
warning: Thread is in kernel mode, not all registers can be accessed.
.() at 0x5c011c90
readn(f = 5, av = 0x20066514, n = 4), line 12 in "readn.c"
_drawrecv(mux = 0x20028ce8, canblock = 1, vp = 0x20066570), line 177 in "drawclient.c"
drawrecv(mux = 0x20028ce8), line 198 in "drawclient.c"
_muxrecv(mux = 0x20028ce8, canblock = 1, vp = 0x20066618), line 98 in "io.c"
muxrpc(mux = 0x20028ce8, tx = 0x20029458), line 141 in "mux.c"
displayrpc(d = 0x20018bf8, tx = 0x200666d0, rx = 0x20066870, freep = (nil)), line 271 in "drawclient.c"
_displayrdkbd(d = 0x20018bf8, r = 0x20066a50), line 337 in "drawclient.c"
keyboard._ioproc(arg = 0x20029b88), line 32 in "keyboard.c"
procmain(p = 0x2004a278), line 384 in "thread.c"
startprocfn(v = 0x20029458), line 99 in "pthread.c"
Thread $t4
warning: Thread is in kernel mode, not all registers can be accessed.
.() at 0x5c011c90
hostproc(arg = 0x20029c98), line 295 in "plan9.c"
procmain(p = 0x20069128), line 384 in "thread.c"
startprocfn(v = 0x20029d38), line 99 in "pthread.c"
Thread $t5
warning: Thread is in kernel mode, not all registers can be accessed.
.() at 0x5c011c90
extproc(argv = 0x20008eb0), line 131 in "plan9.c"
procmain(p = 0x20087fd8), line 384 in "thread.c"
startprocfn(v = 0x20029dd8), line 99 in "pthread.c"
(dbx)

Hopefully that's a little better...

rsc commented 4 years ago

Sorry for the delayed response. I don't understand the thread t1 that is stuck "running" in _string. Maybe it is in an infinite loop of some kind. It is in a loop calling cachechars but it is supposed to make progress on each iteration. Maybe add some debugging prints inside _string or at the top of cachechars to see what the calls look like?

bhuntsman commented 3 years ago

Sorry for leaving this hanging for so long. Something is definitely looping and not making progress. I put some print statements in src/libdraw/font.c in cachechars:

int
cachechars(Font *f, char **ss, Rune **rr, ushort *cp, int max, int *wp, char **subfontname)
{
        int i, th, sh, h, ld, w, rw, wid, nc;
        char *sp;
        Rune r, *rp, vr;
        ulong a;
        Cacheinfo *c, *tc, *ec;

/* Begin AIX debugging */
        print("Font: %s\n", f->name);
        print("ss: %s\n", ss);
        print("rr: %S\n", rr);
        print("cp: %d\n", cp);
        print("max: %d\n", max);
        print("wp: %d\n", wp);
        print("subfontname: %s\n", subfontname);
/* End AIX debugging */
...

I run sam again and get the following output repeated forever:

Font: *default*
ss: Ao
rr: <nil>
cp: 804395824
max: 100
wp: 804395816
subfontname: 

Adding a print after the call to cachechars indicates that it always returns 0 each time as well. Perhaps not a thread issue after all?

Thank you!

bhuntsman commented 3 years ago

This is interesting. I think the problem is with cachechars returning 0. I inserted the following two print statements to the end of the cachechars function in src/libdraw/font.c:

            Found:
                wid += c->width;
                c->age = f->age;
                cp[i] = h;
                i++;
                print("\n--- i: %d\n", i);
        }
        if(ss)
                *ss = sp;
        else
                *rr = rp;
        *wp = wid;
        print("cachecars returning: %d\n", i);
        return i;

When I run sam, cachechars returns 1 a few times, but as soon as I right-click or type a character, I just get endless "cachechars returning: 0".

bhuntsman commented 3 years ago

I did some more digging, comparing a step-by-step run on Linux against a step-by-step on AIX, and tracked the issue down to src/libdraw/string.c:80:

        while((*s || *r) && len){

For example, if I start sam, attach a debugger to the samterm process, and set a breakpoint on cachechars, and then type "a" in the sam window, I can see that we end up calling cachechars twice. The second call is for our "a". This call works correctly, and in cachechars sp gets incremented by the value of w, which is 1 in this case. Now, *sp = "".

On Linux, we get to the while statement above, and the statement evaluates to false and we move on to line 146:

    return pt;

On AIX, the statement evaluates to true and we enter the loop. Now we end up passing garbage arguments to cachechars and it always returns 0 and we end up looping forever.

The following change to string.c resolves the issue for AIX, but causes a segfault on other platforms:

        while((*s || *rptr) && len){

Not sure this is the best way to resolve this, but something like this seems to be a solution that allows AIX to work without breaking anything anywhere else:

diff --git a/src/libdraw/string.c b/src/libdraw/string.c
index 5ba5016a..4f08b1b8 100644
--- a/src/libdraw/string.c
+++ b/src/libdraw/string.c
@@ -81,7 +81,11 @@ _string(Image *dst, Point pt, Image *src, Point sp, Font *f, char *s, Rune *r, i
        }else
                rptr = &r;
        sf = nil;
+#if defined(__AIX__)
+       while((*s || *rptr) && len){
+#else
        while((*s || *r) && len){
+#endif
                max = Max;
                if(len < max)
                        max = len;

I will submit a pull request with this change.