cisco-system-traffic-generator / trex-core

trex-core site
https://trex-tgn.cisco.com/
Other
1.28k stars 459 forks source link

TRex incorrectly treats HTTP pipelining pkt flow (TCP ACKs) as `rtt` delay value instead of `ipg` #156

Open mcallaghan-sandvine opened 5 years ago

mcallaghan-sandvine commented 5 years ago

As per deep investigations into #143 and #146, we were deeply assessing frame.time_delta values for a long netflix flow.

Consider

Invoked with:

sudo ./t-rex-64 -f long_flow.yaml -c 1 -m 1 -d 9999

 Version : v2.43   
 DPDK version : DPDK 17.11.0   
 User    : hhaim   
 Date    : Jul 11 2018 , 09:40:05 
 Uuid    : 3cb87d62-84d5-11e8-8eba-0006f62b3e88    
 Git SHA : e74fc281e57dcd60cac05c2fc43df65967a63671    
==================================== IO Statistics
Duration: 44.0 secs
Interval: 44.0 secs
Col 1: Frames and bytes
----------------------------------
1
Interval Frames Bytes
----------------------------------
0.0 <> 44.0 20911 20936860

====================================


 * the resulting TRex output here is MUCH longer (as noted in #146), but incorrectly blamed #143 -- upon further review, we suspect that TRex us treating the data ACKs and subsequent client data payload pkts with RTT delay, but in fact in reality, there should be only using ipg delay

https://github.com/cisco-system-traffic-generator/trex-core/issues/146#issuecomment-425560148

$ tshark -Q -z io,stat,500 -r trex_ipg_10us_netflix_long_flow.pcap

====================================== IO Statistics
Duration: 143.8 secs
Interval: 143.8 secs
Col 1: Frames and bytes
------------------------------------
1
Interval Frames Bytes
------------------------------------
0.0 <> 143.8 20911 20936860

======================================



----

Capture was taken client-side, without GRO, (shouldn't be a problem/concern)

As one may expect, the incoming flux of packets for a Netflix flow are _quite_ frequent  (<10us inter-arrival), but some are slightly delayed (a few hundred microseconds)

Here is an example of these streaming pipelined HTTP packets (from raw source pcap)
![netflix_raw_pcap_frame_3000](https://user-images.githubusercontent.com/34753045/46500996-34ffea00-c7f2-11e8-82c3-6e17d876f09e.png)

Compare with the output from TRex
![trex_netflix_output_frame_3000](https://user-images.githubusercontent.com/34753045/46501105-87410b00-c7f2-11e8-9869-321524bf7ee8.png)
 * the time deltas here are significantly higher than the source, 10ms :( - it clearly interprets these as TCP-control-flow-pkts, and imposes the RTT delay on those sequences
 * the "pktB" sequences (i.e. frame 3003) have an inter-packet-delay of 0us (pulling from the ipg=10us value, but not accurate due to #143)

----

We recognize that this is probably a costly thing for TRex to process... somehow in pre-processing, it needs to recognize that this flow sequence is HTTP PIPELINING, and that it should apply ipg delay to these pkt sequences and NOT rtt delays

Perhaps this handling could be added including a flag to enable/disable HTTP pipeline inspection or something to avoid default performance hit?

Thoughts welcome!
hhaim commented 5 years ago

I suggest to run this pcap file throw our ASTF offline tool. The offline tool will simulate MSS and RTT using TCP stack and Ipg,RTT will be as you should expect from TCP. We will send instructions how to run this. There is more ways to do the same.

Thanks, Hanoh

On Thu, 4 Oct 2018 at 23:36 Matt Callaghan notifications@github.com wrote:

As per deep investigations into #143 https://github.com/cisco-system-traffic-generator/trex-core/issues/143 and #146 https://github.com/cisco-system-traffic-generator/trex-core/issues/146, we were deeply assessing frame.time_delta values for a long netflix flow.

Consider

long_flow.yaml

  • duration : 9999 generator : distribution : "seq" clients_start : "4.0.0.1" clients_end : "4.0.0.1" servers_start : "5.0.0.1" servers_end : "5.0.20.255" cap_ipg : false <-- do NOT use pcap, it is not sanitized cap_info :
    • name: LAB-5928_longflow/netflix_480p-wide_avc_Thor_noGRO_B_45secs_21k_pkts.pcap w : 1 cps : 0.001 <-- for testing purposes, only send 1 flow and halt ipg : 10 <-- shoot for a "small as possible" inter-packet-delay for non-TCP-control-pkts rtt : 10000 <-- real-world 10ms RTT is fine

Invoked with:

sudo ./t-rex-64 -f long_flow.yaml -c 1 -m 1 -d 9999

Version : v2.43 DPDK version : DPDK 17.11.0 User : hhaim Date : Jul 11 2018 , 09:40:05 Uuid : 3cb87d62-84d5-11e8-8eba-0006f62b3e88 Git SHA : e74fc281e57dcd60cac05c2fc43df65967a63671

$ tshark -Q -z io,stat,500 -r netflix_480p-wide_avc_Thor_noGRO_B_45secs_21k_pkts.pcap

==================================== IO Statistics
Duration: 44.0 secs
Interval: 44.0 secs
Col 1: Frames and bytes
----------------------------------
1
Interval Frames Bytes
----------------------------------
0.0 <> 44.0 20911 20936860

====================================

146 (comment)

https://github.com/cisco-system-traffic-generator/trex-core/issues/146#issuecomment-425560148

$ tshark -Q -z io,stat,500 -r trex_ipg_10us_netflix_long_flow.pcap

====================================== IO Statistics
Duration: 143.8 secs
Interval: 143.8 secs
Col 1: Frames and bytes
------------------------------------
1
Interval Frames Bytes
------------------------------------
0.0 <> 143.8 20911 20936860

======================================


Capture was taken client-side, without GRO, (shouldn't be a problem/concern)

As one may expect, the incoming flux of packets for a Netflix flow are quite frequent (<10us inter-arrival), but some are slightly delayed (a few hundred microseconds)

Here is an example of these streaming pipelined HTTP packets (from raw source pcap) [image: netflix_raw_pcap_frame_3000] https://user-images.githubusercontent.com/34753045/46500996-34ffea00-c7f2-11e8-82c3-6e17d876f09e.png

Compare with the output from TRex [image: trex_netflix_output_frame_3000] https://user-images.githubusercontent.com/34753045/46501105-87410b00-c7f2-11e8-9869-321524bf7ee8.png

  • the time deltas here are significantly higher than the source, 10ms :( - it clearly interprets these as TCP-control-flow-pkts, and imposes the RTT delay on those sequences
  • the "pktB" sequences (i.e. frame 3003) have an inter-packet-delay of 0us (pulling from the ipg=10us value, but not accurate due to #143 https://github.com/cisco-system-traffic-generator/trex-core/issues/143 )

We recognize that this is probably a costly thing for TRex to process... somehow in pre-processing, it needs to recognize that this flow sequence is HTTP PIPELINING, and that it should apply ipg delay to these pkt sequences and NOT rtt delays

Perhaps this handling could be added including a flag to enable/disable HTTP pipeline inspection or something to avoid default performance hit?

Thoughts welcome!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cisco-system-traffic-generator/trex-core/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/AMbjvUbDVtabIHA19c6ntszA92Y_6V1Iks5uhnE-gaJpZM4XI3bQ .

-- Hanoh Sent from my iPhone

hhaim commented 5 years ago

using v2.46 version you can run this on netflix pcap and it will be replay with rtt of 5msec network ./astf-sim -f [netflix.pcap] --rtt 5 -o netflix_5msec.pcap

the output netflix_5msec.pcap will be with the right IPG/RTT

mcallaghan-sandvine commented 5 years ago

I would tend to agree that ASTF would likely not have this issue as it would track packet-by-packet and hopefully know to interpret HTTP pipelining as inter-packet-delay timings rather than RTT.

However, we would need this to be accommodated in STF. (For now we're going to have to proceed w/out this, and sometime in the future we'll come back to it and hopefully address/fix in the code base.)

Not sure though if pcap re-capture (processing) by ASTF would fix it though... (needs testing as Hanoh suggests)

hhaim commented 5 years ago

I think we are not on the same page. The idea is to run the offline tool on the pcap for running them in STF mode. for ASTF there is no need for offline tool. The tool will solve the problems you are describing here.

On Wed, Oct 10, 2018 at 10:04 PM Matt Callaghan notifications@github.com wrote:

I would tend to agree that ASTF would likely not have this issue as it would track packet-by-packet and hopefully know to interpret HTTP pipelining as inter-packet-delay timings rather than RTT.

However, we would need this to be accommodated in STF. (For now we're going to have to proceed w/out this, and sometime in the future we'll come back to it and hopefully address/fix in the code base.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cisco-system-traffic-generator/trex-core/issues/156#issuecomment-428693549, or mute the thread https://github.com/notifications/unsubscribe-auth/AMbjvXTITObU2VF1krZ82p4WSdxf_VODks5ujkTCgaJpZM4XI3bQ .

-- Hanoh Sent from my iPhone

mcallaghan-sandvine commented 5 years ago

Hm. Well currently our use case and solution is to use STF. (we chose this mode to accomplish our goals) - and this is one of the limitations/issues uncovered w/ that mode.

Switching modes to ASTF is not an option for us at this time.

hhaim commented 5 years ago

I’m not suggesting to move to ASTF

Hanoh

On Thu, 11 Oct 2018 at 16:31 Matt Callaghan notifications@github.com wrote:

Hm. Well currently our use case and solution is to use STF. (we chose this mode to accomplish our goals) - and this is one of the limitations/issues uncovered w/ that mode.

Switching modes to ASTF is not an option for us at this time.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cisco-system-traffic-generator/trex-core/issues/156#issuecomment-428955916, or mute the thread https://github.com/notifications/unsubscribe-auth/AMbjvUILAaeb9XSYeCw1XpdJuWTqR4vpks5uj0gYgaJpZM4XI3bQ .

-- Hanoh Sent from my iPhone