you got bufferbloat issues on multiple fronts

dtaht commented 9 years ago

0) Policing is a horrible idea. However it is commonly used in the field. It is impossible to set the burst parameter correctly for any range of RTTs.

1) qdisc netem 8001: parent 1:2 limit 1000 delay 10.0ms loss 1%

the arbitrary setting of limit 1000 does not match reality. We have seen both much larger and much smaller values in the field. try fiddling with this parameter in your testing.

Also, setting 3 queues of limit 1000 will do "interesting things" if you flood all the queues.

2) you have htb doing stuff to "direct" for some reason, not sure if htb "default 0" is intentional. For some pithy comments on how to screw up with htb, see: http://www.bufferbloat.net/projects/cerowrt/wiki/Wondershaper_Must_Die

3) I have never really trusted netem in conjunction with other qdiscs.

4) It would be nice to be able to inject cures like fq_codel into the emulations (but see 3)

5) I would love it if you were to try netperf-wrapper (also on github) against your tool, with emulations of cablemodems created with this tool (cable uses byte fifos, not packet fifos, btw), and see if your results line up with what we get from real world examples like

http://snapon.lab.bufferbloat.net/~cero2/jimreisert/results.html

http://burntchrome.blogspot.com/2014/05/fixing-bufferbloat-on-comcasts-blast.html

We have plenty of results for wifi as well, perhaps we can find a meeting of the tools that makes sense.

Always nice to see more network analysis and debugging tools out there!

6) if you wish to discuss further, it is saner to do it on the bloat bufferbloat.net mailing list.

chantra commented 9 years ago

Hi @dtaht ,

Thanks for the feedback. I guess it will take some time to process, but I really appreciate the feedback.

Let me try to answer some of the questions/statements though:

0) Policing is the only way we found at dropping packets when going above the allocated bandwidth. To be honest, there should be better ways of doing it, but this seemed to worked and reproduce the behaviour we expected. That being said, if you have any better suggestion, please feel free to tell us.

1) IIRC, 1000 is just set by default

2) Same here, I think this is just a default

3) What would be better? Anything you believe could be better?

4) I have never heard of fq_codel, is this in mainline kernels? I see ubuntu trusty has it, I am not sure about CentOS 6 which we want to support too.

5) Thanks for pointing to this. cc @zfjagann whom started implementing e2e test

6) Thanks. I will check it out.

dtaht commented 9 years ago

I am depressed that all our attempts at outreach on this issue are only now reaching more developers of test tools. A search on google scholar will give you many of the relevant papers and research on this topic including 3g/4g lte networks.: https://scholar.google.com/scholar?as_ylo=2011&q=bufferbloat&hl=en&as_sdt=0,5

-1) many of the problems you are seeing in the field can be attributed to bufferbloat.

0) Policing is commonly used but not to my knowledge all that much in 3g/4g networks.

1) You will get interesting results if you try the netem limit figure between 100 and 10000 also. Try it. Notably with netperf-wrapper.

2) the htb default 0 sends stuff out the non-managed htb direct queue. Looking at your statistics, it seems stuff was going out that, which I dont think was your intent, so it is saner to default to one of the defined queues in htb in case you missed a rule.

3) toke ended up using something closer to ipfw and an iptables redirect rule to get better control while testing some of the solutions to bufferbloat on his testbed.

relevant talk at stanford here: https://www.youtube.com/watch?v=kePhqfKA3SM

(while I am at it, other relevant talks here: http://www.bufferbloat.net/projects/cerowrt/wiki/Bloat-videos by the likes of van jacobson, jim gettys, stephen hemminger, toke, myself, fred baker and jesper bauer The shortest talk I ever gave on this subject as at uknof )

we also documented everything that can go wrong in a testbed setup with various qdiscs here:

http://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel

We (bug hemminger!) really do need a better version of netem to cope with the variable rtt case while still holding to the same overall packet limit.

4) bufferbloat was the problem. fq_codel is probably the best answer we have - now standard in openwrt, most aftermarket router firmwares, and on by default in systemd, etc. Early versions of streamboost used it also. There is an ietf draft on it and other new aqm technologies like docsis-pie, but it has been slow as yet to hit the wifi or 3g markets. bufferbloat.net´s next big project is make-wifi-fast, which we hope will apply similar techniques to wifi also.

In the meantime, having good emulations of current behavior is helpful for developers of e2e apps.

The "sqm-scripts" we developed and distributed are capable of comprehensive emulations of cable and dsl modem behavior, although that is not what is shipped in the default version (just the cures). The actual emulations use bfifos and I guess I should make them available as part of the default versions.

that said, I would recommend looking those over as for correct and interesting ways to configure htb.

5) there are now 30+ tests in netperf-wrapper, and it comes with extensive gui and multiple test run analytic support. Patches and feedback welcomed. In particular I would like to see the web traffic emulation improved.

6) look forward to seeing your questions and feedback on the lists. We have been at fixing this for 4+ years. Google and redhat has been all over it, I am glad to see facebook also taking a harder look. Together we can make a faster internet!

chantra commented 9 years ago

I still did not get a chance to read through/watch those documents, but at least for 2) those were example output which did not represent real traffic, which is why the stats are not showing anything going through.

zealws commented 9 years ago

Wishlisting since this is a decent portion of work and needs more discussion before it will be actionable.

dtaht commented 9 years ago

The example stats showed packets going through the direct portion of the qdisc, which I think was not your intent.

Better test tools are on my wishlist also! Do keep trying to get more folk to run netperf-wrapper´s rrul tests.....

ghost commented 9 years ago

Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.

dtaht commented 8 years ago

Did y'all get anywhere?

facebookarchive / augmented-traffic-control

you got bufferbloat issues on multiple fronts #60