POETSII / tinsel

Manythread RISC-V overlay for FPGA clusters
Other
35 stars 1 forks source link

Hostlink not connecting on Byron #76

Closed m8pple closed 4 years ago

m8pple commented 5 years ago

(This may not be diagnosable, as I'm not running a standard tinsel version, but thought I'd post it here just in case).

I was running things on byron, and the hardware+hostlink seemed to be working fine. However, at some point I started getting the message:

dt10@byron:~/poets-ecosystem$ pts-serve --code code.v --data data.v --elf tinsel.elf --headless true
waiting for all externals... done
Error writing to socket
dt10@byron:~/poets-ecosystem$ top

After 10 minutes or so it was still showing the same problem, while the exact same command had previously worked. It was actually running as part of a script working through parameters, and it stopped working after I ctrl-c'd out of the script, so I'm wondering if it caused some kind of protocol problem on the server side. It was a ctrl-c flood (holding down the key), and the last process to run was starting up hostlink, so it maybe it managed to terminate the link at some delicate point? No idea.

I ran it under gdb, and the problem seems to be a broken pipe:

(gdb) break exit
Breakpoint 1 at 0xecd0
(gdb) r
Starting program: /home/dt10/poets-ecosystem/pts-serve/pts-serve --code code.v --data data.v --elf tinsel.elf --headless true
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
waiting for all externals... done

Program received signal SIGPIPE, Broken pipe.
0x00007ffff6ec0c4d in __libc_send (fd=fd@entry=4, buf=buf@entry=0x7fffffffd100, len=len@entry=32, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/send.c:28
28      ../sysdeps/unix/sysv/linux/send.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6ec0c4d in __libc_send (fd=fd@entry=4, buf=buf@entry=0x7fffffffd100, len=len@entry=32, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/send.c:28
#1  0x0000555555598662 in socketBlockingPut (fd=4, buf=0x7fffffffd100 "", numBytes=32) at SocketUtils.cpp:152
#2  0x0000555555596465 in HostLink::send (this=0x7fffffffd380, dest=<optimized out>, numFlits=<optimized out>, payload=<optimized out>, block=<optimized out>)
    at HostLink.cpp:227
#3  0x0000555555596638 in HostLink::boot (this=0x7fffffffd380, codeFilename=<optimized out>, dataFilename=<optimized out>) at HostLink.cpp:301
#4  0x000055555556426e in main ()
(gdb)

This is running on commit b5faf14bc9e5dcdda7d276a134173b4448138682 of tinsel, so the bottom of that stack is:

https://github.com/POETSII/tinsel/blob/b5faf14bc9e5dcdda7d276a134173b4448138682/hostlink/HostLink.cpp#L301

and it is trying to write the very first code byte into RAM (x=0,y=0,i=0). As far as I can tell, the pcie socket connected has dropped for some reason.

I also notice that there are about 200 sockets hanging around in CLOSE_WAIT, all of which were listening on 10101, which seems to be the pci daemon port:

dt10@byron:~/poets-ecosystem/submodules/graph_schema$ netstat --tcp
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 localhost.localdo:10101 localhost.localdo:46712 CLOSE_WAIT
tcp        0      0 localhost.localdo:10101 localhost.localdo:46886 CLOSE_WAIT
tcp        0      0 byron.cl.cam.ac.u:10101 byron.cl.cam.ac.u:35034 CLOSE_WAIT
tcp        0      0 byron.cl.cam.ac.u:10101 byron.cl.cam.ac.u:39282 CLOSE_WAIT
tcp        0      0 byron.cl.cam.ac.u:10101 byron.cl.cam.ac.u:38548 CLOSE_WAIT
tcp        0      0 localhost.localdo:10101 localhost.localdo:36628 CLOSE_WAIT
...

There is a possibility that I messed up something with the merge, as I was merging tinsel across quite long commit-history differences. Tomorrow I'll try to see if it is still happening if I clone the master tinsel repo (which I think is the one that I should be using for Byron, according to poets-cloud?)

I can run it on Defoe instead, but thought I'd try one of the bleeding edge boards :)

mn416 commented 5 years ago

Hi David,

Sorry, I only just saw this issue. Seems github doesn't send me email notifications for this anymore, or else gmail is just hiding it.

Byron has been my dev machine with experimental bitfiles and software for the last while so I'm not sure what state it was in when you saw this.

Port 10101 is the board control deamon (which provides remote access to the JTAG UARTS).

I haven't seen this problem myself, but I will look into the CLOSE_WAIT thing -- maybe that is indicating I am doing something dodgy somewhere.

m8pple commented 5 years ago

No worries, it was more of an FYI given it was a dev machine.

Having a few CLOSE_WAIT sockets hanging around for a bit is normal, but there seemed more than usual given all the connections are local. Could be completely benign and unrelated though.

On 04/04/2019 17:47, mn416 wrote:

Hi David,

Sorry, I only just saw this issue. Seems github doesn't send me email notifications for this anymore, or else gmail is just hiding it.

Byron has been my dev machine with experimental bitfiles and software for the last while so I'm not sure what state it was in when you saw this.

Port 10101 is the board control deamon (which provides remote access to the JTAG UARTS).

I haven't seen this problem myself, but I will look into the CLOSE_WAIT thing -- maybe that is indicating I am doing something dodgy somewhere.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/POETSII/tinsel/issues/76#issuecomment-479953662, or mute the thread https://github.com/notifications/unsubscribe-auth/AC4nQTyWBP-rvYWOgz-HwByu2i-Y-0Nsks5vdh6kgaJpZM4biDey.

mn416 commented 4 years ago

I think there was an issue with CLOSE_WAIT sockets hanging around, but it got fixed in commit e9f122d4. I forgot to link that commit to this issue.