Closed GoogleCodeExporter closed 9 years ago
Just a couple updates.
Jumbo frames doesn't matter - it got the same crazy byte stats (184467......)
with only 1500byte frame size.
Dropped packet rate down to 50000pps, and still get error (with 1500byte frame).
Stream is 1500 length, random data payload (or all zeros), send fixed packets,
1 packet, 50,000packets/sec, after stream- goto first. Protocol is L1-Mac,
untagged, L2,L3,L4,L5 all none.
Random data payload or all zeros payload get same issue.
Original comment by whoehenr...@gmail.com
on 20 Mar 2012 at 9:18
Is your system/OS 32-bit or 64-bit?
The latest revision that you are using has a different stats implementation for
Linux. Can you move to rev f7e9c1843041 (old stats implementation) and check if
you still see the problem?
I'm not able to recreate the problem on my 64bit Ubuntu 11.04. Can you save and
attach your stream? How frequently do you see these large values?
I suspect some counter rollover issue or similar ...
Original comment by pstav...@gmail.com
on 21 Mar 2012 at 2:07
System is 64 bit OpenSuse 11.4 running on a Dell i5 desktop with 4gigs memory.
I'll try the previous rev when I get the chance.
The crazy byte stats occur frequently, like every couple seconds. Appears to
happen even more when I'm pumping more data through (higher line rate).
I have another system (OpenSuse 64 bit with earlier stat code, first rev you
had with error counts) and haven't seen the issue there. It's a much slower
system though, so I want to try the old code on the same one with issues.
Thanks
Original comment by whoehenr...@gmail.com
on 21 Mar 2012 at 3:02
Ok - I downgraded to the f7e9c1843041 version and don't see this issue any more
(it's been running for several minutes and I usually got the "crazy" stats
withing a few seconds). So - looks like an issue with the updated stats method
you mentioned.
I've also attached the streams I use (one for each of my 10Ge ports).
Thanks again.
-Werner
Original comment by whoehenr...@gmail.com
on 21 Mar 2012 at 10:13
Attachments:
Just one other bit of data. I've duplicated the crazy byte stats on two
different machines (actually Dell i3 systems, not i5). Reverting to the
f7e9c1843041 version corrected the stats both times.
I did notice one side effect: The older stats seems to take up some processing
power as my line rate dropped at bit (using identical streams). With the
latest revision code (which had the periodic byte count errors), I averaged a
line rate of 995MB/s out my ports. Changing stats to the older version, my
line rate drops to around 940MB/s. So - I can see why you changed your stat
count method (it's easier on resources and you can pump out more data).
Let me know if you need more info.
Thanks,
Werner
Original comment by whoehenr...@gmail.com
on 23 Mar 2012 at 5:20
Yes, the stats method was changed to improve performance and it also has the
benefit of providing link state (up/down) and in future more detailed error
stats and 64-bit stats.
Since I haven't yet been able to recreate the problem, can you add some debug
prints and check?
Look at linuxport.cpp, line 637
stats->rxBytes = rtnlStats->rx_bytes;
This is where the stats retrieved from the kernel (rtnlStats) is assigned into
the applications data struct (stats). Try adding a print like this just after
this line -
if (stats->rxBytes > 0xffffffff)
{
qDebug("Crazy Stats: %llu", stats->rxBytes);
qDebug("Kernel Stat: %lu", rtnlStats->rx_bytes);
}
The print should be added within the if condition because otherwise there will
be a print every 1 second and the rate will slow down drastically and we may
not see the problem. You can play around with the value in the if condition.
Sorry, I don't have access to my development machine right now, otherwise I
would have given you a proper patch to try out.
Original comment by pstav...@gmail.com
on 23 Mar 2012 at 6:53
Happy to help out.
The "crazy stats" seem to occur the most on the rx Bps counters - so I added
qDebug statements for that too.
if (!stats)
break;
stats->rxPps = (rtnlStats->rx_packets - stats->rxPkts)
/ kRefreshFreq_;
stats->rxBps = (rtnlStats->rx_bytes - stats->rxBytes)
/ kRefreshFreq_;
if (stats->rxBps > 0xfffffffffff)
{
qDebug("Crazy Stats rxBps: %llu", stats->rxBps);
qDebug("Kernel rtnlStat rx_bytes: %lu", rtnlStats->rx_bytes);
qDebug("stats-rxBytes: %lu", stats->rxBytes);
qDebug("refresh: %lu", kRefreshFreq_);
}
stats->rxPkts = rtnlStats->rx_packets;
stats->rxBytes = rtnlStats->rx_bytes;
if (stats->rxBytes > 0xfffffffffff)
{
qDebug("Crazy Stats rxBytes: %llu", stats->rxBytes);
qDebug("Kernel Stat: %lu", rtnlStats->rx_bytes);
}
Here is a sample of the qDebug output below): Looks like (for some reason time
to time), the updated Kernel rx_bytes value is smaller than the previous
value?? Since you don't expect a negative number - I assume that's what's
causing trouble.
Let me know what else I can do to help debug.
Regards,
Werner
Crazy Stats rxBps: 18446744070389094012
Kernel rtnlStat rx_bytes: 603730944
stats-rxBytes: 3924188548
refresh: 1
Crazy Stats rxBps: 18446744070382077132
Kernel rtnlStat rx_bytes: 629050613
stats-rxBytes: 3956525097
refresh: 1
Crazy Stats rxBps: 18446744070389318912
Kernel rtnlStat rx_bytes: 204121708
stats-rxBytes: 3524354412
refresh: 1
Crazy Stats rxBps: 18446744070382329020
Kernel rtnlStat rx_bytes: 202804221
stats-rxBytes: 3530026817
refresh: 1
Crazy Stats rxBps: 18446744070389372888
Kernel rtnlStat rx_bytes: 782242632
stats-rxBytes: 4102421360
refresh: 1
Crazy Stats rxBps: 18446744070382194080
Kernel rtnlStat rx_bytes: 745624841
stats-rxBytes: 4072982377
refresh: 1
Crazy Stats rxBps: 18446744070389372888
Kernel rtnlStat rx_bytes: 385170168
stats-rxBytes: 3705348896
refresh: 1
Crazy Stats rxBps: 18446744070382832796
Kernel rtnlStat rx_bytes: 321672329
stats-rxBytes: 3648391149
refresh: 1
Original comment by whoehenr...@gmail.com
on 23 Mar 2012 at 11:16
@werner: Thanks for the debug! the rx/tx Bps (and even the pps) will expectedly
have a large value when the 32bit kernel stat counter rolls over - which in
your case will be often since you are pumping large packets at a high rate.
I'll fix the rate calculation for the rollover condition.
But I still don't know why the rxBytes and txBytes have the extra large values.
Remove the Bps prints, retain the rxBytes prints and send me the output for
that. You can reduce the if condition value to 0xffffffff (max 32-bit value)
Thanks once again for your help in debugging this
Original comment by pstav...@gmail.com
on 24 Mar 2012 at 6:26
Ok - well this is interesting.
I removed the Bps prints and retained the rxBytes prints (reducing the if
condition to 0xffffffff). After the rebuild, the system (with my current very
high traffic rate) frequently has crazy values on all the byte counters (Byte
received, bytes sent, and Bps) - but there are NO qdebug messages. I double
checked the debug statement, and everything looks ok.
I'm running the streams with a Avg bps of 10,300,000,000 to maximize data.
Any thoughts? The bytes sent and received counters are almost constantly at
the crazy values (the lower digits change every second, but stay at the
184467440xxxxxxxxxxx size).
Original comment by whoehenr...@gmail.com
on 26 Mar 2012 at 4:19
Are you clearing the stats counters while traffic is running? If so the problem
is possibly in server/abstractport.c, AbstractPort::Stats() method where the
stats_.XXXX is less than epochStats_.XXXX. Whenever you clear the counters, the
current counters are saved as epochStats_.
As a test change the following line in this function
stats->rxBytes = stats_.rxBytes - epochStats_.rxBytes;
to
stats->rxBytes = (stats_.rxBytes > epochStats_.rxBytes) ?
stats_.rxBytes - epochStats_.rxBytes :
stats_.rxBytes + (0xFFFFFFFFFFFFFFFF - epochStats_.rxBytes);
Be careful with that all FFF magic value - its the max possible 64bit value (16
Fs)
Original comment by pstav...@gmail.com
on 26 Mar 2012 at 7:30
I think you have the right idea on the problem. I didn't get a chance to
incorporate any code changes - but I tried the following.
Brought up a fresh run of Ostinato and didn't clear any of the registers. I
worked fine - crazy bps rate counters - but bytes received/sent stayed
reasonable. It ran this way for several minutes. Then, I cleared the counters
while traffic was running. Immediately - the bytes received and sent counters
went to crazy 184467440xxxxx.. values.
One question: How big can the bytes received / sent counters get in the
current implementation? Since I'm running at 10Gbps - I pump a lot of data and
sent/recieved grow quickly. I just wondered if I need to realize I'll get a
roll-over at some point?
Thanks,
Werner
Original comment by whoehenr...@gmail.com
on 26 Mar 2012 at 10:25
[deleted comment]
@werner: the current stats (and in the previous implementation also) are
32-bit. So if you are sending large size packets at a high rate, the rollover
will happen soon. With the new stats implementation, it is possible to support
64 bit stats if the kernel supports the same - please file a new bug requesting
for the same so that I can track that enhancement.
let me know once you get a chance to try the code change i suggested.
Original comment by pstav...@gmail.com
on 29 Mar 2012 at 3:56
Ok - well, I tried the code change in abstractport.ccp (see stats capture for
change), but the RX bytes counter still went wacky right after I cleared stats
(see other screen capture).
Let me know what else you'd like to try.
Thanks,
Werner
Original comment by whoehenr...@gmail.com
on 2 Apr 2012 at 6:22
Attachments:
@werner: my mistake. use 0xFFFFFFFF instead of 0xFFFFFFFFFFFFFFFF. Although the
variable is 64bit, the actual stats is only 32bit - so we need to use the max
32bit value.
Original comment by pstav...@gmail.com
on 3 Apr 2012 at 1:58
Ok - we are heading in the right direction, but with a minor side effect.
Using the 0xFFFFFFFF value - this corrected the bytes_rx stat counter so it
never "went crazy". I also did the same type change to the bytes_tx stat
counter (see changes below). The values do frequently turn over (I can see
them increment and then start over again as I assume they reach the 32bit
maximum).
Now the side effect - when you clear the stats (clear all button on Statistics
screen), the Bytes received and Bytes Sent don't go to zero. Rather, they go
to 4294967295 (see other capture below). An annoying side effect (although I
figure you were just experimenting with this code change anyway).
Thanks,
Werner
Original comment by whoehenr...@gmail.com
on 3 Apr 2012 at 4:21
Attachments:
@werner: change the condition
from
(stats_.rxBytes > epochStats_.rxBytes)
to
(stats_.rxBytes >= epochStats_.rxBytes)
Original comment by pstav...@gmail.com
on 3 Apr 2012 at 5:44
Good call - that change (> going to >=) fixes the "clear stats" issue. They
all clear to zero now (I did the change on both the rxBytes and txBytes).
Thanks,
Werner
Original comment by whoehenr...@gmail.com
on 3 Apr 2012 at 6:32
@werner: Thanks for all the help with debugging and testing. I will fix this
shortly.
Notes based on the debugging/testing -
Stats should take care of counter roll over for -
1. Diff from epoch
2. Rx/Tx pps/Bps rates (for each platform)
3. Rollover calculation to take care of 32bit/62bit stats
Original comment by pstav...@gmail.com
on 3 Apr 2012 at 6:39
This issue was closed by revision 12c46735e7b3.
Original comment by pstav...@gmail.com
on 14 Apr 2012 at 6:07
@werner: I've committed the changes to fix this. Can you please verify and let
me know that you don't see the problem anymore?
Original comment by pstav...@gmail.com
on 15 Apr 2012 at 3:06
Yes - looks like you fixed the problem. I used the latest 1355baddf744 build
and ran for some time without any "crazy" values. I also "cleared the
statistics" several times too.
I do still see the bytes received and sent values reset (because they are 32bit
counters and get overrun so quickly with the traffic I'm bumping, but that's
another issue I already filed.
Thanks again,
Werner
Original comment by whoehenr...@gmail.com
on 17 Apr 2012 at 9:20
@werner: Thanks for the confirmation that the fix works. Will implement the 64
bit stats when I get some time.
Original comment by pstav...@gmail.com
on 18 Apr 2012 at 4:42
Original issue reported on code.google.com by
whoehenr...@gmail.com
on 20 Mar 2012 at 8:48Attachments: