9fans / plan9port

Plan 9 from User Space
https://9fans.github.io/plan9port/
Other
1.63k stars 322 forks source link

factotum: 'no rpc pending' on p9sk1 client #455

Open fhs opened 4 years ago

fhs commented 4 years ago

factotum seems to be failing to do p9sk1 correctly most of the time. For example, when trying to connect to u9fs (running with -a p9any), it succeeds only 7/20 times here:

% for(i in `{seq 20}) { 9p -a 'tcp!localhost!3210' ls README.md }
9p: mount: authentication failed
README.md
9p: mount: authentication failed
README.md
9p: mount: authentication failed
9p: mount: authentication failed
README.md
README.md
9p: mount: authentication failed
9p: mount: authentication failed
9p: mount: authentication failed
README.md
9p: mount: authentication failed
README.md
README.md
9p: mount: authentication failed
9p: mount: authentication failed
9p: mount: authentication failed
9p: mount: authentication failed
9p: mount: authentication failed

I was also able to reproduce the issue trying to connect to fossil serving 9P. I was NOT able reproduce the issue with native Plan 9 factotum. The code also looks very different compared to p9p factotum. The issue only becomes apparent on Plan 9 when I bind /mnt/term/mnt/factotum /mnt/factotum in drawterm.

When authentication fails, factotum 9P protocol trace shows this:

% 9 factotum -D -n -d
....
 <-4- Tattach tag 1 fid 2 afid -1 uname fhs aname 
-4-> Rattach tag 1 qid (0000000000000001 0 d)
<-4- Twalk tag 1 fid 2 newfid 0 nwname 1 0:rpc 
-4-> Rwalk tag 1 nwqid 1 0:(0000000000000002 0 ) 
<-4- Topen tag 1 fid 0 mode 2
fid mode is 0x2
-4-> Ropen tag 1 qid (0000000000000002 0 ) iounit 0
<-4- Tclunk tag 1 fid 2
-4-> Rclunk tag 1
<-4- Twrite tag 1 fid 0 offset 0 count 29 'start proto=p9any role=client'
-4-> Rwrite tag 1 count 29
<-4- Tread tag 1 fid 0 offset 29 count 4096
-4-> Rread tag 1 count 2 'ok'
<-4- Twrite tag 1 fid 0 offset 31 count 5 'read '
-4-> Rwrite tag 1 count 5
<-4- Tread tag 1 fid 0 offset 36 count 4096
-4-> Rread tag 1 count 40 'phase in state 'read offer' want 'write''
<-4- Twrite tag 1 fid 0 offset 76 count 6 'write '
-4-> Rwrite tag 1 count 6
<-4- Tread tag 1 fid 0 offset 82 count 4096
-4-> Rread tag 1 count 13 'toosmall 4096'
<-4- Twrite tag 1 fid 0 offset 95 count 22 'write p9sk1@localhost\0'
-4-> Rwrite tag 1 count 22
<-4- Tread tag 1 fid 0 offset 117 count 4096
-4-> Rread tag 1 count 2 'ok'
<-4- Twrite tag 1 fid 0 offset 119 count 5 'read '
-4-> Rwrite tag 1 count 5
<-4- Tread tag 1 fid 0 offset 124 count 4096
-4-> Rread tag 1 count 19 'ok p9sk1 localhost\0'
<-4- Twrite tag 1 fid 0 offset 143 count 5 'read '
-4-> Rwrite tag 1 count 5
<-4- Tread tag 1 fid 0 offset 148 count 4096
-4-> Rread tag 1 count 11 '6f6b20f5 0894b757 903e38'
<-4- Twrite tag 1 fid 0 offset 159 count 5 'read '
-4-> Rwrite tag 1 count 5
<-4- Tread tag 1 fid 0 offset 164 count 4096
-4-> Rread tag 1 count 42 'phase in state 'read tickreq' want 'write''
<-4- Twrite tag 1 fid 0 offset 206 count 6 'write '
-4-> Rwrite tag 1 count 6
<-4- Tread tag 1 fid 0 offset 212 count 4096
-4-> Rread tag 1 count 12 'toosmall 141'
<-4- Twrite tag 1 fid 0 offset 224 count 147 '77726974 65200167 6c656e64 61000000 00000000 00000000 00000000 00000000 0000006c 6f63616c 686f7374 00000000 00000000 00000000 00000000 00000000'
-4-> Rwrite tag 1 count 147
<-4- Tread tag 1 fid 0 offset 371 count 4096
-4-> Rread tag 1 count 2 'ok'
<-4- Twrite tag 1 fid 0 offset 373 count 5 'read '
-4-> Rwrite tag 1 count 5
authdial: Connection refused
<-4- Tread tag 1 fid 0 offset 378 count 4096
-4-> Rerror tag 1 ename no rpc pending
<-4- Tclunk tag 1 fid 0
-4-> Rclunk tag 1

Digging a bit deeper, it looks like factotum p9sk1 client (in convthread) is in write ticket+auth state and setting c->rpc.op = RpcUnknown in rpc.c:/^rpcrespondn. Then, fs.c:/^fsread (in fsreqthread thread) is checking if c->rpc.op == RpcUnknown and thus fails with no rpc pending. The Conv *c is shared between the two threads. Shouldn't there be a lock?

vestein463 commented 3 years ago

p9p factotum definitely contains a Heisenbug. On FreeBSD, "factotum -ndD" hangs after "rpc start". The answer "ok" is not sent. Whe I remove the treadcreate for convthread, I got the "ok", but the next step hangs again. Now I tried running factotum under valgrind, and: the same program just works as expected.

The valgrind log is:

==1217== Using Valgrind-3.17.0.GIT and LibVEX; rerun with -h for copyright info
==1217== Command: factotum -n
==1217== 
==1217== Warning: ignored attempt to set SIGSTOP handler in sigaction();
==1217==          the SIGSTOP signal is uncatchable
==1217== Conditional jump or move depends on uninitialised value(s)
==1217==    at 0x486FD93: ??? (in /lib/libthr.so.3)
==1217==    by 0x241609: _threadsetupdaemonize (daemonize.c:147)
==1217==    by 0x2406A0: p9main (thread.c:840)
==1217==    by 0x24D118: main (main.c:10)
==1217== 
==1217==