bo-yang / plan9front

Automatically exported from code.google.com/p/plan9front
0 stars 0 forks source link

x61 ethernet stops working after some days #104

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Preliminary report; I don't have properly-collected data on this. This is me 
figuring out what data to collect.

A week or two ago my 9front box's ethernet stopped responding. Existing cpu 
connections stopped passing any data and timed out. Pinging both to and from 
the box failed to get any packets back, (I forget the exact error,) and no new 
connections could be made.

Today it happened again. I was using a telnet connection and there was some 
warning in the form of considerable delays in passing lines of text before it 
stopped working altogether.

As a very rough estimate, about the same amount of data passed between boot and 
failure both times.

Original issue reported on code.google.com by tereniao...@gmail.com on 30 Dec 2011 at 12:00

GoogleCodeExporter commented 9 years ago
First question:
What driver exactly?

Ideas on reproducing
********************

12:03:53 < cinap_lenrek> was there high traffic when it happens? or what the 
card idle for a long time?
12:05:31 < EthanG_> not idle either time
12:06:01 < EthanG_> high-ish traffic the second time; there was a vnc 
connection in use shortly before it happened
12:06:40 < EthanG_> small vnc screen (600x600) playing a game that redrew its 
whole window every time the mouse went down or up
12:07:03 < cinap_lenrek> try to stress it
12:07:21 < EthanG_> aye
12:07:42 < cinap_lenrek> you can also read /mnt/term/dev/zero over cpu 
connection or something like that

Driver modding
**************

11:57:50 < cinap_lenrek> igbeinterrupt() doesnt print anything
11:58:40 < EthanG_> should i put somethig in there or would it go off with 
every packet?
11:59:35 < cinap_lenrek> EthanG_: first, get the spec, then check what the bits 
in the interrupt status register mean
12:02:06 < cinap_lenrek> also, find/google/bribe/steal the hardware spec

when it happens
***************

11:54:16 < cinap_lenrek> run snoopy
11:54:26 < cinap_lenrek> and wireshark or whatever on another machine
11:54:32 < cinap_lenrek> then ping arround
11:55:05 < cinap_lenrek> this way, you can figure out if it still works in some 
direction
11:55:24 < cinap_lenrek> maybe it just fails to receive packets, but is still 
able to send them
11:55:37 < cinap_lenrek> sometimes it can receive, but sending packets is fucked

11:56:23 < cinap_lenrek> check for any messages on the console
11:56:33 < EthanG_> aye
11:56:37 < cinap_lenrek> maybe it did a print when it hit some error condition
11:56:59 < EthanG_> I don't think it did last time.

12:00:19 < cinap_lenrek> EthanG_: even wihout modifying the code, you can cat 
the status files of the ethernet device and check if interrupt counters still 
increase when sending/receiving packets

12:00:50 < cinap_lenrek> EthanG_: and do that basic snoopy/tcpdump check
12:01:10 < cinap_lenrek> that should get us some better symptoms than "it stops 
working randomly"
12:01:24 < EthanG_> yeah

Possible fixes
**************

12:02:46 < cinap_lenrek> if we're unable to fix it, we might just reset the 
card if it happens
12:02:51 < cinap_lenrek> that often gets stuff working again

Stray thoughts
**************

12:08:22 < cinap_lenrek> maybe its not even the network card
12:08:26 < EthanG_> aye, aye
12:08:31 < cinap_lenrek> but some other shit is locked up in the ipstack
12:08:50 < EthanG_> yeah could be
12:09:20 < cinap_lenrek> maybe you can add a 2nd network card?
12:09:39 < EthanG_> It would have to be usb
12:10:05 < cinap_lenrek> fun :)
12:10:14 < EthanG_> no thanks :)

Original comment by tereniao...@gmail.com on 30 Dec 2011 at 1:43

GoogleCodeExporter commented 9 years ago
Happened again. Always happens when I want to relax.

Telnet recieved a line soon after the failure, but not 7-8 minutes later. It 
was failing to send before the one recieved line came through.

From /net/ether0/ifstats, good packets recieved increased by 9 between the 
first 2 cats after failure, as did broadcast packets recieved. 1 or 2 telnet 
packets were expected in this same timeframe, I guess they didn't arrive.

Possibly related to the ethernet lead slipping out of its socket. It always 
reconnects whent he lead is pushed back in, but this time and the last this 
problem has occured shortly after reconnecting.

Original comment by tereniao...@gmail.com on 31 Dec 2011 at 10:03

GoogleCodeExporter commented 9 years ago
Note, completely disconnecting and reconnecting the ethernet lead does not fix 
this.

More next time it happens.

I was wrong about it happening after a certain amount of data or time.

Original comment by tereniao...@gmail.com on 31 Dec 2011 at 10:06

GoogleCodeExporter commented 9 years ago
Closing this. It's either been fixed en passant or was hardware trouble which 
I'm no longer triggering (loose socket).

Original comment by tereniao...@gmail.com on 17 Oct 2012 at 10:03