danilogio / ostinato

Automatically exported from code.google.com/p/ostinato
GNU General Public License v3.0
0 stars 0 forks source link

Bytes counters (received, sent, send rate, receive rate) momentarily show crazy values #67

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Opensuse Linux system 11.4, I5 desktop 
2.10Gb/s CNA installed (dual port card)
3. Jumbo frames - 9000bytes/frame (supported on CNA)
4. Simple random data stream, 1 packet, but selected 10,000,000packets/sec

What is the expected output? What do you see instead?

Stream is transmitted and works, but port counters (Bytes Received, Bytes Sent, 
Byte Send Rate) will show incredibly large values momentarily 
(18446744073063945534 bytes received for example) and then go back to the 
normal reading.  The Bytes counters continue bouncing back with these crazy 
values (always seem to start with 184467...) and then back to regular values. 
"Frame" counters seem normal.

What version and revision of the product are you using (available in the
About dialog)? On what operating system?
Opensuse Linux system - Ostinato 0.5 rev b781e68848cd
(latest beta build)

Please provide any additional information below.

Don't know if it's related to Jumbo frames, will need to test more.  I am 
transmitting at a very high throughput - and am not seeing any errors.

This is the latest beta build.

See attached file for a screen capture with the crazy values shown.  Note that 
"Frame" counters seem normal, just Byte counters.

Thanks for all your efforts - Werner.

Original issue reported on code.google.com by whoehenr...@gmail.com on 20 Mar 2012 at 8:48

Attachments:

GoogleCodeExporter commented 9 years ago
Just a couple updates.

Jumbo frames doesn't matter - it got the same crazy byte stats (184467......) 
with only 1500byte frame size.

Dropped packet rate down to 50000pps, and still get error (with 1500byte frame).

Stream is 1500 length, random data payload (or all zeros), send fixed packets, 
1 packet, 50,000packets/sec,  after stream- goto first.  Protocol is L1-Mac, 
untagged, L2,L3,L4,L5 all none.

Random data payload or all zeros payload get same issue.

Original comment by whoehenr...@gmail.com on 20 Mar 2012 at 9:18

GoogleCodeExporter commented 9 years ago
Is your system/OS 32-bit or 64-bit?

The latest revision that you are using has a different stats implementation for 
Linux. Can you move to rev f7e9c1843041 (old stats implementation) and check if 
you still see the problem?

I'm not able to recreate the problem on my 64bit Ubuntu 11.04. Can you save and 
attach your stream? How frequently do you see these large values?

I suspect some counter rollover issue or similar ...

Original comment by pstav...@gmail.com on 21 Mar 2012 at 2:07

GoogleCodeExporter commented 9 years ago
System is 64 bit OpenSuse 11.4 running on a Dell i5 desktop with 4gigs memory. 

I'll try the previous rev when I get the chance. 

The crazy byte stats occur frequently, like every couple seconds.  Appears to 
happen even more when I'm pumping more data through (higher line rate).

I have another system (OpenSuse 64 bit with earlier stat code, first rev you 
had with error counts) and haven't seen the issue there. It's a much slower 
system though, so I want to try the old code on the same one with issues.

Thanks

Original comment by whoehenr...@gmail.com on 21 Mar 2012 at 3:02

GoogleCodeExporter commented 9 years ago
Ok - I downgraded to the f7e9c1843041 version and don't see this issue any more 
(it's been running for several minutes and I usually got the "crazy" stats 
withing a few seconds).  So - looks like an issue with the updated stats method 
you mentioned.

I've also attached the streams I use (one for each of my 10Ge ports).

Thanks again.

-Werner

Original comment by whoehenr...@gmail.com on 21 Mar 2012 at 10:13

Attachments:

GoogleCodeExporter commented 9 years ago
Just one other bit of data.  I've duplicated the crazy byte stats on two 
different machines (actually Dell i3 systems, not i5).  Reverting to the 
f7e9c1843041 version corrected the stats both times.

I did notice one side effect:  The older stats seems to take up some processing 
power as my line rate dropped at bit (using identical streams).  With the 
latest revision code (which had the periodic byte count errors), I averaged a 
line rate of 995MB/s out my ports.  Changing stats to the older version, my 
line rate drops to around 940MB/s.  So - I can see why you changed your stat 
count method (it's easier on resources and you can pump out more data).

Let me know if you need more info.

Thanks,
 Werner

Original comment by whoehenr...@gmail.com on 23 Mar 2012 at 5:20

GoogleCodeExporter commented 9 years ago
Yes, the stats method was changed to improve performance and it also has the 
benefit of providing link state (up/down) and in future more detailed error 
stats and 64-bit stats.

Since I haven't yet been able to recreate the problem, can you add some debug 
prints and check?

Look at linuxport.cpp, line 637
stats->rxBytes = rtnlStats->rx_bytes;

This is where the stats retrieved from the kernel (rtnlStats) is assigned into 
the applications data struct (stats). Try adding a print like this just after 
this line -

if (stats->rxBytes > 0xffffffff)
{
    qDebug("Crazy Stats: %llu", stats->rxBytes);
    qDebug("Kernel Stat: %lu", rtnlStats->rx_bytes);
}

The print should be added within the if condition because otherwise there will 
be a print every 1 second and the rate will slow down drastically and we may 
not see the problem. You can play around with the value in the if condition.

Sorry, I don't have access to my development machine right now, otherwise I 
would have given you a proper patch to try out.

Original comment by pstav...@gmail.com on 23 Mar 2012 at 6:53

GoogleCodeExporter commented 9 years ago
Happy to help out.

The "crazy stats" seem to occur the most on the rx Bps counters - so I added 
qDebug statements for that too.

                if (!stats)
                        break;

                    stats->rxPps  = (rtnlStats->rx_packets - stats->rxPkts)
                                        / kRefreshFreq_;
                    stats->rxBps  = (rtnlStats->rx_bytes - stats->rxBytes)
                                        / kRefreshFreq_;
if (stats->rxBps > 0xfffffffffff)
{
 qDebug("Crazy Stats rxBps: %llu", stats->rxBps);
 qDebug("Kernel rtnlStat rx_bytes: %lu", rtnlStats->rx_bytes);
 qDebug("stats-rxBytes: %lu", stats->rxBytes);
 qDebug("refresh: %lu", kRefreshFreq_);
}
                    stats->rxPkts  = rtnlStats->rx_packets;
                    stats->rxBytes = rtnlStats->rx_bytes;
if (stats->rxBytes > 0xfffffffffff)
{
 qDebug("Crazy Stats rxBytes: %llu", stats->rxBytes);
 qDebug("Kernel Stat: %lu", rtnlStats->rx_bytes);
}

Here is a sample of the qDebug output below):  Looks like (for some reason time 
to time), the updated Kernel rx_bytes value is smaller than the previous 
value??  Since you don't expect a negative number - I assume that's what's 
causing trouble.

Let me know what else I can do to help debug.

Regards,
 Werner

Crazy Stats rxBps: 18446744070389094012
Kernel rtnlStat rx_bytes: 603730944
stats-rxBytes: 3924188548
refresh: 1
Crazy Stats rxBps: 18446744070382077132
Kernel rtnlStat rx_bytes: 629050613
stats-rxBytes: 3956525097
refresh: 1
Crazy Stats rxBps: 18446744070389318912
Kernel rtnlStat rx_bytes: 204121708
stats-rxBytes: 3524354412
refresh: 1
Crazy Stats rxBps: 18446744070382329020
Kernel rtnlStat rx_bytes: 202804221
stats-rxBytes: 3530026817
refresh: 1
Crazy Stats rxBps: 18446744070389372888
Kernel rtnlStat rx_bytes: 782242632
stats-rxBytes: 4102421360
refresh: 1
Crazy Stats rxBps: 18446744070382194080
Kernel rtnlStat rx_bytes: 745624841
stats-rxBytes: 4072982377
refresh: 1
Crazy Stats rxBps: 18446744070389372888
Kernel rtnlStat rx_bytes: 385170168
stats-rxBytes: 3705348896
refresh: 1
Crazy Stats rxBps: 18446744070382832796
Kernel rtnlStat rx_bytes: 321672329
stats-rxBytes: 3648391149
refresh: 1

Original comment by whoehenr...@gmail.com on 23 Mar 2012 at 11:16

GoogleCodeExporter commented 9 years ago
@werner: Thanks for the debug! the rx/tx Bps (and even the pps) will expectedly 
have a large value when the 32bit kernel stat counter rolls over - which in 
your case will be often since you are pumping large packets at a high rate. 
I'll fix the rate calculation for the rollover condition.

But I still don't know why the rxBytes and txBytes have the extra large values. 
Remove the Bps prints, retain the rxBytes prints and send me the output for 
that. You can reduce the if condition value to 0xffffffff (max 32-bit value)

Thanks once again for your help in debugging this

Original comment by pstav...@gmail.com on 24 Mar 2012 at 6:26

GoogleCodeExporter commented 9 years ago
Ok - well this is interesting.
I removed the Bps prints and retained the rxBytes prints (reducing the if 
condition to 0xffffffff).  After the rebuild, the system (with my current very 
high traffic rate) frequently has crazy values on all the byte counters (Byte 
received, bytes sent, and Bps) - but there are NO qdebug messages.  I double 
checked the debug statement, and everything looks ok.

I'm running the streams with a Avg bps of 10,300,000,000 to maximize data.

Any thoughts?  The bytes sent and received counters are almost constantly at 
the crazy values (the lower digits change every second, but stay at the 
184467440xxxxxxxxxxx size).

Original comment by whoehenr...@gmail.com on 26 Mar 2012 at 4:19

GoogleCodeExporter commented 9 years ago
Are you clearing the stats counters while traffic is running? If so the problem 
is possibly in server/abstractport.c, AbstractPort::Stats() method where the 
stats_.XXXX is less than epochStats_.XXXX. Whenever you clear the counters, the 
current counters are saved as epochStats_. 

As a test change the following line in this function
stats->rxBytes = stats_.rxBytes - epochStats_.rxBytes;
to
stats->rxBytes = (stats_.rxBytes > epochStats_.rxBytes) ? 
                     stats_.rxBytes - epochStats_.rxBytes : 
                     stats_.rxBytes + (0xFFFFFFFFFFFFFFFF - epochStats_.rxBytes);

Be careful with that all FFF magic value - its the max possible 64bit value (16 
Fs)

Original comment by pstav...@gmail.com on 26 Mar 2012 at 7:30

GoogleCodeExporter commented 9 years ago
I think you have the right idea on the problem.  I didn't get a chance to 
incorporate any code changes - but I tried the following.
Brought up  a fresh run of Ostinato and didn't clear any of the registers.  I 
worked fine - crazy bps rate counters - but bytes received/sent stayed 
reasonable.  It ran this way for several minutes.  Then, I cleared the counters 
while traffic was running.  Immediately - the bytes received and sent counters 
went to crazy 184467440xxxxx..  values.

One question:  How big can the bytes received / sent counters get in the 
current implementation?  Since I'm running at 10Gbps - I pump a lot of data and 
sent/recieved grow quickly.  I just wondered if I need to realize I'll get a 
roll-over at some point?

Thanks,
 Werner

Original comment by whoehenr...@gmail.com on 26 Mar 2012 at 10:25

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
@werner: the current stats (and in the previous implementation also) are 
32-bit. So if you are sending large size packets at a high rate, the rollover 
will happen soon. With the new stats implementation, it is possible to support 
64 bit stats if the kernel supports the same - please file a new bug requesting 
for the same so that I can track that enhancement.

let me know once you get a chance to try the code change i suggested.

Original comment by pstav...@gmail.com on 29 Mar 2012 at 3:56

GoogleCodeExporter commented 9 years ago
Ok - well, I tried the code change in abstractport.ccp (see stats capture for 
change), but the RX bytes counter still went wacky right after I cleared stats 
(see other screen capture).

Let me know what else you'd like to try.

Thanks,
 Werner

Original comment by whoehenr...@gmail.com on 2 Apr 2012 at 6:22

Attachments:

GoogleCodeExporter commented 9 years ago
@werner: my mistake. use 0xFFFFFFFF instead of 0xFFFFFFFFFFFFFFFF. Although the 
variable is 64bit, the actual stats is only 32bit - so we need to use the max 
32bit value.

Original comment by pstav...@gmail.com on 3 Apr 2012 at 1:58

GoogleCodeExporter commented 9 years ago
Ok - we are heading in the right direction, but with a minor side effect.
Using the 0xFFFFFFFF value - this corrected the bytes_rx stat counter so it 
never "went crazy".  I also did the same type change to the bytes_tx stat 
counter (see changes below).  The values do frequently turn over (I can see 
them increment and then start over again as I assume they reach the 32bit 
maximum).

Now the side effect - when you clear the stats (clear all button on Statistics 
screen), the Bytes received and Bytes Sent don't go to zero.  Rather, they go 
to 4294967295 (see other capture below).  An annoying side effect (although I 
figure you were just experimenting with this code change anyway).

Thanks,
 Werner

Original comment by whoehenr...@gmail.com on 3 Apr 2012 at 4:21

Attachments:

GoogleCodeExporter commented 9 years ago
@werner: change the condition 
from 
(stats_.rxBytes > epochStats_.rxBytes)
to
(stats_.rxBytes >= epochStats_.rxBytes)

Original comment by pstav...@gmail.com on 3 Apr 2012 at 5:44

GoogleCodeExporter commented 9 years ago
Good call - that change (> going to >=)  fixes the "clear stats" issue.  They 
all clear to zero now (I  did the change on both the rxBytes and txBytes).

Thanks,
Werner

Original comment by whoehenr...@gmail.com on 3 Apr 2012 at 6:32

GoogleCodeExporter commented 9 years ago
@werner: Thanks for all the help with debugging and testing. I will fix this 
shortly.

Notes based on the debugging/testing -

Stats should take care of counter roll over for -
1. Diff from epoch
2. Rx/Tx pps/Bps rates (for each platform)
3. Rollover calculation to take care of 32bit/62bit stats

Original comment by pstav...@gmail.com on 3 Apr 2012 at 6:39

GoogleCodeExporter commented 9 years ago
This issue was closed by revision 12c46735e7b3.

Original comment by pstav...@gmail.com on 14 Apr 2012 at 6:07

GoogleCodeExporter commented 9 years ago
@werner: I've committed the changes to fix this. Can you please verify and let 
me know that you don't see the problem anymore?

Original comment by pstav...@gmail.com on 15 Apr 2012 at 3:06

GoogleCodeExporter commented 9 years ago
Yes - looks like you fixed the problem.  I used the latest 1355baddf744 build 
and ran for some time without any "crazy" values.  I also "cleared the 
statistics" several times too.

I do still see the bytes received and sent values reset (because they are 32bit 
counters and get overrun so quickly with the traffic I'm bumping, but that's 
another issue I already filed.

Thanks again,
 Werner

Original comment by whoehenr...@gmail.com on 17 Apr 2012 at 9:20

GoogleCodeExporter commented 9 years ago
@werner: Thanks for the confirmation that the fix works. Will implement the 64 
bit stats when I get some time.

Original comment by pstav...@gmail.com on 18 Apr 2012 at 4:42