tsproxy performance can drop significantly after a period of time (intermittent)

acreskeyMoz commented 5 years ago

When using tsproxy for repeated browsertime page loads we intermittently see a huge drop off in performance after a few minutes.

We are using browsertime to drive Chrome for Android, with tsproxy as the network throttler. Generally WebPageReplay is used as the http archive.

Once the performance has dropped, it will persist in that state until the tsproxy process is restarted.

I believe the problem is coming from tsproxy because when I remove WebPageReplay from the equation we can still reproduce the issue (i.e. when using tsproxy to throttle access to lives sites we still see the problem).

Similarly, we've not seen this issue when using browsertime and Chrome directly against live sites.

I'm seeing this behaviour on MacOS (High Sierra) and my colleague is seeing it on Ubuntu linux.

I've attached both the tsproxy logs (verbose) as well as browsertime's.

In this case it the problem surfaced in slightly under 5 minutes at around 10:57:52 in the logs. You can see the browsertime backendTime metric jump from ~2100ms to ~18000ms on iteration 4 onwards.

I noticed that from tsproxy's logs there are no more long tick times (e.g. 1 second), and you can see a continuous spew of 0.001 Tick Time's:

10:57:51.450 - Tick Time: 1.000
...
10:57:52.688 - Tick Time: 0.001
10:57:52.689 - Tick Time: 0.001
10:57:52.691 - Tick Time: 0.001
10:57:52.693 - Tick Time: 0.001
...

 tsproxy is started as follows (verbose logging added here), in this case bound and directed to the host machine's local ip where WebPageReplay is running: python tsproxy.py -b 192.168.86.244 -d 192.168.86.244 -m '80:8080,443:8081' --rtt=200 -vvvvv

For reasons unknown, we see this problem more frequently with the following urls,

https://booking.com
http://www.espn.com/nba/story/_/page/allstarweekend25788027/the-comparison-lebron-james-michael-jordan-their-own-words
https://www.bbc.com/news/business-47245877

We're on the latest commit of tsproxy: 6ccc0eb5a8d308608d155545d265bd425765d724

Let me know if there is additional information I can provide.

tsproxy_logs.txt webpagereplay logs.txt

acreskeyMoz commented 5 years ago

Here are browsertime logs for a session in which it looks like tsproxy got into a bad state (it was logging Tick Time: 0.001 continuously, but I restarted tsproxy after iteration 5 and you can see the performance metrics return to normal. browsertime_logs_restart_tsproxy_after_iteration_5.txt

acreskeyMoz commented 5 years ago

I've found a simpler method to reproduce this issue (the method doesn't require Browsertime).

Start tsproxy on your host machine, e.g. python tsproxy.py -b {host_ip} --rtt=200 -vvvv
Configure Chrome for android to point to tsproxy on your host machine, e.g. --proxy-server=socks://{host_ip}:1080
Run the following bash script to repeatedly clear and launch Chrome, loading the provided website
```
i=0
```

while [ $i -le 40 ] do adb shell am force-stop com.android.chrome adb shell pm clear com.android.chrome

adb shell am start -n com.android.chrome/com.google.android.apps.chrome.Main -a android.intent.action.VIEW -d "https://www.bbc.com/news/business-47245877"

echo Number: $i ((i++)) sleep 10s done


4. Notice that initially the page will have time to load.
After a number of iterations, `tsproxy` will be begin to output something like:

11:51:06.587 - Tick Time: 0.001 11:51:06.589 - Tick Time: 0.001 11:51:06.590 - Tick Time: 0.001


and at this point Chrome will no longer load the page within the 10 seconds given.

If the bash script is cancelled at the point, Chrome will generally hang mid load.

pmeenan commented 5 years ago

No promises on a fix but I'm looking at it now. Managed to reproduce it with a simple puppeteer script:

const puppeteer = require('puppeteer');

(async () => {
  while (true) {
    const browser = await puppeteer.launch({
        headless: false,
        args: ['--proxy-server=socks://127.0.0.1:1080'],
        timeout: 10000
        });
    const page = await browser.newPage();
    await page.goto('https://www.bbc.com/news/business-47245877');
    await browser.close();
  }
})();

I'm hoping it's a clock problem with a packet getting stuck in one of the queues with an invalid time and maybe it will be a simple check to detect when that happens.

pmeenan commented 5 years ago

Fix is inbound (just letting it test for a while to make sure it is stable first). Looks like the task_done in the queue was somehow getting called more often than there were queue items and it would start throwing an exception. I don't see how that's possible since it's paired 1:1 with pulling items from the queue but since we don't join() the packet queue we don't actually need to know the task status so I just removed it.

So far it has been running for ~30 minutes with a test every 10 seconds with now issues.

acreskeyMoz commented 5 years ago

Thank you @pmeenan With the latest commit tsproxy is passing both my shell script test as well as long-running browsertime runs. Nicely done!

catchpoint / WebPageTest.tsproxy

tsproxy performance can drop significantly after a period of time (intermittent) #20