facebookarchive / augmented-traffic-control

Augmented Traffic Control: A tool to simulate network conditions
https://facebook.github.io/augmented-traffic-control
Other
4.33k stars 600 forks source link

Enabling ATC shaping breaks TCP flows when using ATC and ATC client Vagrant setup #123

Closed jrabek closed 9 years ago

jrabek commented 9 years ago

Setup:

On Macbook Pro running OSX 10.10.3

ATC using following commit

commit 3fc5cd95e8370b680d18b8a0618463003c591792
Author: Zeal Jagannatha <zealjagannatha@gmail.com>
Date:   Tue May 26 11:25:24 2015 -0700

    Error handling in restore-profiles.sh

Runing ATC using Vagrant:

jrabek:~/projects/atc/chef/atc: (master) $ vagrant up trusty

Running ATC client using Vagrant

jrabek:~/projects/atc/chef/atcclient: (master) $ vagrant up atcclient01
# Setup output not shown
jrabek:~/projects/atc/chef/atcclient: (master) $ cd ../..
jrabek:~/projects/atc: (master) $ ./utils/restore-profiles.sh 127.0.0.1:8080
# restore-profiles.sh ran successfully

Run wget as a sanity check to check connectivity and unshaped network speed:

vagrant@atcclient01:~$ wget -O - www.cnn.com > /dev/null
--2015-05-27 17:53:51--  http://www.cnn.com/
Resolving www.cnn.com (www.cnn.com)... 199.27.79.73
Connecting to www.cnn.com (www.cnn.com)|199.27.79.73|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86146 (84K) [text/html]
Saving to: `STDOUT'

100%[=========================================================>] 86,146      --.-K/s   in 0.04s

2015-05-27 17:53:51 (2.06 MB/s) - written to stdout [86146/86146]

In the ATC webui on 100.64.33.3:8000/atc_demo_ui/ on the atcclient01 instance, set the profile to anything including DSL or Cable and then repeat the same transfer:

vagrant@atcclient01:~$ wget -O - www.cnn.com > /dev/null
--2015-05-27 18:46:55--  http://www.cnn.com/
Resolving www.cnn.com (www.cnn.com)... 199.27.79.73
Connecting to www.cnn.com (www.cnn.com)|199.27.79.73|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83913 (82K) [text/html]
Saving to: `STDOUT'

22% [===================================>                     ] 19,079       368B/s  eta 5m 33s

No matter what the profile, the transfer rate is slowed to a crawl.

I did a quick sanity check on the ATC vagrant instance to make sure the wget network traffic is going through the ATC instance and it is.

vagrant@trusty:~$ sudo tcpdump -i eth1
jrabek commented 9 years ago

Note that if I turn off the shaping in the demo ui, the wget transfer is fast again.

jrabek commented 9 years ago

I ran atcd from the shell and captured the logs (below) when I applied the DSL profile.

INFO:AtcdVService.AtcdLinuxShaper:Request startShaping TrafficControl(device=TrafficControlledDevice(controllingIP='100.64.33.101', controlledIP='100.64.33.101'), timeout=86400, settings=TrafficControlSetting(down=Shaping(loss=Loss(percentage=0.0, correlation=0.0), delay=Delay(delay=5, jitter=0, correlation=0.0), rate=2000, iptables_options=[], corruption=Corruption(percentage=0.0, correlation=0.0), reorder=Reorder(percentage=0.0, correlation=0.0, gap=0)), up=Shaping(loss=Loss(percentage=0.0, correlation=0.0), delay=Delay(delay=5, jitter=0, correlation=0.0), rate=256, iptables_options=[], corruption=Corruption(percentage=0.0, correlation=0.0), reorder=Reorder(percentage=0.0, correlation=0.0, gap=0))))
INFO:AtcdVService.AtcdLinuxShaper:Shaping ip 100.64.33.101 on interface eth0
INFO:AtcdVService.AtcdLinuxShaper:create new HTB class on IFID eth0, classid 1:2,parent 1:0, rate 256kbits
INFO:AtcdVService.AtcdLinuxShaper:create new Netem qdisc on IFID eth0, parent 1:2, loss 0.0%, delay 5000
INFO:AtcdVService.AtcdLinuxShaper:create new FW filter on IFID eth0, classid 1:2, handle 2, rate: 256kbits
INFO:AtcdVService.AtcdLinuxShaper:Running /sbin/iptables -t mangle -A FORWARD -d 100.64.33.101 -i eth0  -j MARK --set-mark 2
INFO:AtcdVService.AtcdLinuxShaper:Shaping ip 100.64.33.101 on interface eth1
INFO:AtcdVService.AtcdLinuxShaper:create new HTB class on IFID eth1, classid 1:2,parent 1:0, rate 2000kbits
INFO:AtcdVService.AtcdLinuxShaper:create new Netem qdisc on IFID eth1, parent 1:2, loss 0.0%, delay 5000
INFO:AtcdVService.AtcdLinuxShaper:create new FW filter on IFID eth1, classid 1:2, handle 2, rate: 2000kbits
INFO:AtcdVService.AtcdLinuxShaper:Running /sbin/iptables -t mangle -A FORWARD -s 100.64.33.101 -i eth1  -j MARK --set-mark 2

After capturing these logs I reran

wget -O - www.cnn.com > /dev/null

and still saw the extremely slow transfer.

jrabek commented 9 years ago

I looked through some of the closed issues and noticed that someone else saw something similar but not the same here: https://github.com/facebook/augmented-traffic-control/issues/86#issuecomment-88655287

chantra commented 9 years ago

@jrabek what happen if you run an end to end test with hosts directly before and after ATC?

Do you want to try https://github.com/facebook/augmented-traffic-control/issues/86#issuecomment-90631200 ?

jrabek commented 9 years ago

@chantra Thanks! I cherry-picked the commit and it seems to resolve the issue. I confirmed using the same test I described previously in this bug.

commit cee20a691b361c81ccb163db55caa129c040a9c5
Author: chantra <chantra@fb.com>
Date:   Sun Apr 5 13:18:15 2015 -0700

    Command line argument to buffer oackets instead of dropping

    atcd --atcd-dont-drop-packets

Any reason why this has not been merged into master yet and included by default in the arguments to atcd? My set up is vanilla out-of-the-box so I am surprised more people aren't hitting it.

jrabek commented 9 years ago

I should clarify that after cherry-picking the commit, I completely recreated the trusty instance, ssh'd into trusty, stopped atcd, and added the --atcd-dont-drop-packets option and restarted atcd before retesting. Wanted to make sure that someone reading the bug didn't think that just cherry-picking the commit is enough.

chantra commented 9 years ago

@jrabek well, I somehow plan to review that part of the code when I get time for it and so I did not want to land something that may change in the future.

jrabek commented 9 years ago

@chantra, sounds good. This can be resolved then. Thanks again for your help. I went ahead and forked the repo (https://github.com/airtimemedia/augmented-traffic-control) so we can have something that works out of the box for us.

jrabek commented 9 years ago

Just as a note, I think what is happening might be related to this comment from https://github.com/facebook/augmented-traffic-control/pull/125#issuecomment-109472048:

1) You assume a policing behavior with no buffer. While an unlimited buffer is not realistic, policing is not that common either, and you end-up having TCP collapsing.

jrabek commented 9 years ago

@chantra @zfjagann, so in its current state is ATC usable and accurate for Facebook? I ask since currently TCP flows seem to be broken if shaping is enabled and if the following commit is used as suggested in this bug then TCP works but the network delays grows continually (which makes sense since there is no dropping).

commit cee20a691b361c81ccb163db55caa129c040a9c5
Author: chantra <chantra@fb.com>
Date:   Sun Apr 5 13:18:15 2015 -0700

    Command line argument to buffer oackets instead of dropping

    atcd --atcd-dont-drop-packets

Are there any other fixes or workarounds until issue #60 is fixed that would allow ATC to properly shape TCP flows without breaking them?

Whatever packet dropping being done doesn't seem to be correct.

Thanks again for the tool and open sourcing it. Please let me know if there are any packet captures that would be useful.

chantra commented 9 years ago

@jrabek there is no quick fix/workaround currently.

As much as it may not be super accurate, we have a bunch of profiles that are used to be representative of some of the situations we are trying to emulate.

jrabek commented 9 years ago

@chantra, so I just got an interesting result and may have reopened the bug too soon.

Ignoring the accuracy issues, I previously had problems with TCP transfers not working when using the vagrant setup.

I subsequently set up ATC on a linux box in a configuration that matches the one mentioned in the main ATC README. The wget transfers seem to work in the bare metal configuration. May be some issue with the vagrant set up.

I'll update the bug title.

chantra commented 9 years ago

Oh yeah, vagrant (virtualbox) is not recommended... the main issue is most likely due to scheduling in the VM that will differ from bare metal.

This has been brought up a few times already. As much as vagrant is convenient to set up a dev environment, virtualization is not going to provide accurate scheduling.

chantra commented 9 years ago

this is definitely not something we can support though.

jrabek commented 9 years ago

Makes total sense to not support it and just recommend a bare metal setup in the main README. That said, I think having a huge disclaimer in the README would be helpful since there is a vagrant setup provided. It made it seem like vagrant would be a valid option for evaluating/using ATC.

Thanks again for the quick responses.

chantra commented 9 years ago

thats fair. I think I did somewhere... but will doublecheck

chantra commented 9 years ago

:metal: