Sporadically high "SSL Negotiation" times in new Linux agents

wildlyinaccurate commented 6 years ago

I've noticed that "SSL Negotiation" times are sometimes very high (~3x higher than usual) in Chrome on the EC2 agents. I can't replicate it on the other agents ~or in Firefox~ (edit: I can replicate it in Firefox now).

Some example test runs:

Normal negotiation times and another one for good measure
High SSL negotiation time

Let me know if there's any more information I can provide or testing that I can do.

wildlyinaccurate commented 6 years ago

Initially I thought this was something to do with switching from the old wptdriver to the new wptagent, but now I'm not so sure. In the graph below, the switch to the new agents happened on 13 December 2017, but the sporadic SSL negotiation times didn't start happening until 16 January 2018.

screenshot from 2018-02-22 12-05-50

I'm also performing some other tests and finding that I can't reproduce the long negotiation times on every site. Some sites have consistently normal times.

pmeenan commented 6 years ago

Looks like Chrome 64 didn't go out until January 24th (in case it had to do with Chrome)

pmeenan commented 6 years ago

Might be worth enabling tcpdump to see if we can capture what is going on during the slower negotiations in case there is something not showing up in the waterfall.

wildlyinaccurate commented 6 years ago

Sure! Here's a tcpdump of the normal times and here's one of the slow times.

Here's a screencap of the certificate exchange in the slow one. A retransmission so early in the exchange is a little bit suspect. I wonder if this is an EC2 networking issue - do you have any non-EC2 Linux agents running on WPT that I could run tests on?

screenshot from 2018-02-22 12-43-18

pmeenan commented 6 years ago

Sure. Most locations that offer Chrome are Linux now (except for the Thinkpad and Aspire locations which are Windows). Dulles for example.

wildlyinaccurate commented 6 years ago

I was able to reproduce it on Dulles, although it seems to happen much less often. Here's a test where you can see it happening: https://www.webpagetest.org/result/180226_K0_3578e48ac0493c01148fd1b22d255545/4/details/#waterfall_view_step1

wildlyinaccurate commented 6 years ago

I can also reproduce it on Firefox. At this point I'm not really sure how to proceed with the investigation. I'm definitely no expert on this stuff so I could be completely missing the mark but here's a brain dump:

The client hangs during the TLS handshake, right after the server key exchange.
This causes the server to retransmit a packet after what looks like a 300ms delay. In most cases the client sends the packet not long after the 300ms delay.
The CPU is not idle during the delay, which makes me wonder whether the client is taking a long time to verify the server key?
I don't think misconfigured firewall rules are to blame, because the handshake completes eventually.
I wonder whether the client's crypto lib has been configured to perform the verification in a blocking way?
I don't see any evidence of this in the tcpdump but could the client be blocking the handshake while it fetches an up-to-date certificate revocation list?

pmeenan commented 6 years ago

I don't think it is a OCSP/CRL check - the tcpdump doesn't show it.

I managed to reproduce it with Chrome's netlog enabled (run 6 here). With a 150ms RTT it had a 577ms SSL negotiation and the netlog shows 578ms so

I stripped out the raw hex bytes of the payload but the key events look to be:

t= 3165 [st=    0] +SOCKET_ALIVE  [dt=10766]
t= 3165 [st=    0]   +TCP_CONNECT  [dt=152]
t= 3165 [st=    0]      TCP_CONNECT_ATTEMPT  [dt=152]
                        --> address = "104.28.12.62:443"
t= 3317 [st=  152]   -TCP_CONNECT
t= 3317 [st=  152]   +SOCKET_IN_USE  [dt=10614]
t= 3317 [st=  152]     +SSL_CONNECT  [dt=578]
t= 3318 [st=  153]        SSL_HANDSHAKE_MESSAGE_SENT
t= 3318 [st=  153]        SOCKET_BYTES_SENT
                          --> byte_count = 208
t= 3475 [st=  310]        SOCKET_BYTES_RECEIVED
                          --> byte_count = 1460
t= 3475 [st=  310]        SSL_HANDSHAKE_MESSAGE_RECEIVED
                          --> type = 2
t= 3863 [st=  698]        SOCKET_BYTES_RECEIVED
                          --> byte_count = 3470
t= 3863 [st=  698]        SSL_HANDSHAKE_MESSAGE_RECEIVED
                          --> type = 11
t= 3864 [st=  699]        SSL_HANDSHAKE_MESSAGE_RECEIVED
                          --> type = 22
t= 3864 [st=  699]        SSL_HANDSHAKE_MESSAGE_RECEIVED
                          --> type = 12
t= 3865 [st=  700]        SSL_HANDSHAKE_MESSAGE_RECEIVED
                          --> type = 14
t= 3865 [st=  700]        SSL_HANDSHAKE_MESSAGE_SENT
                          --> type = 16
t= 3865 [st=  700]        SSL_HANDSHAKE_MESSAGE_SENT
                          --> type = 20
t= 3865 [st=  700]        SOCKET_BYTES_SENT
                          --> byte_count = 93
t= 3865 [st=  700]        SSL_CERTIFICATES_RECEIVED
t= 3866 [st=  701]       +CERT_VERIFIER_REQUEST  [dt=27]
t= 3893 [st=  728]       -CERT_VERIFIER_REQUEST
t= 3893 [st=  728]        SIGNED_CERTIFICATE_TIMESTAMPS_RECEIVED
t= 3894 [st=  729]        SIGNED_CERTIFICATE_TIMESTAMPS_CHECKED
t= 3895 [st=  730]        CERT_CT_COMPLIANCE_CHECKED
                          --> build_timely = true
                          --> ct_compliance_status = "COMPLIES_VIA_SCTS"
t= 3895 [st=  730]     -SSL_CONNECT
                        --> cipher_suite = 49195
                        --> is_resumed = false
                        --> next_proto = "h2"
                        --> version = "TLS 1.2"
t= 3895 [st=  730]     +SOCKET_IN_USE  [dt=10036]

From the in-browser net stack's perspective, it looks like it got the first part of the TLS handhake from the server ~150ms after sending the client hello but there was ~380ms gap between the first packet's worth of data (1460 byte read) and the remaining 3470 bytes of the server hello.

The tcpdump captures packets outside of the traffic-shaping so there is ~150ms between what you see in the tcpdump and the actual kernel.

The tcpdump doesn't look like it is showing the same delay but it is showing a retransmit:

The ACKs for everything after the first packet get stalled and don't seem to go out until after the retransmit though the retransmit is only for the last bit which is the certificate. There isn't any other traffic in the tcpdump so it doesn't look like a contention issue.

From the netlog and tcpdump it looks like some interaction between the traffic-shaping and tcp stacks on either end of the connection. it's still not 100% clear what happened in the 300ms between the ACK for 1461 and the ACK for 2921. It could be a delayed ACK where ACK's are only sent for every other packet (with up to a 500ms delay) but there also shouldn't have been more than a few milliseconds between 1461 and 2921 arriving (with the 1.6Mbps tc profile).

It's possible that there is some scheduling voodoo going on wither with the VM itself or the accuracy netem is getting with the kernel ticks but neither should be nearly that big (maybe ~4ms of jitter).

I'll try on a few different configs (Linux vs Windows, VM vs real hardware, etc) to see what conditions it appears to show up in.

wildlyinaccurate commented 6 years ago

Thanks Pat, really appreciate the detailed response. Let me know if there's any further testing or investigation I can do on my end?

pmeenan commented 6 years ago

I have more data but not necessarily answers yet.

Across 50 runs on the Dulles VM's it showed up 5 times.
When I reduced the number of VM's on each server from 10 to 8 it showed up 2 times
On the physical thinkpads running windows there were no long outliers but the timings were more variable.
On the c4.large EC2 instance it showed up 14 times
On a single VM running on my desktop dev machine it was super clean with a really tight packing and only a single result that varied by 20ms. That VM is running desktop instead of server though (not sure if it uses a different scheduler).

It feels a lot like a VM scheduling issue but it's still hard to be sure. I'm going to try reducing the VM stacking on the Dulles Linux agents further to see if it can be completely eliminated or how low it needs to go for that to happen. That doesn't help with the c4.large results though. I may try c5.large as they switched from Xen to KVM to see if that impacts anything.

pmeenan commented 6 years ago

Running EC2 video capture at 10fps to match Dulles (instead of 30fps) showed 2 outliers. It looks like at a minimum there may be some contention with video capture at high frame rates.

I also tested against the actual WPT origin instead of the Fastly edge (to match the Dulles tests) and it was a bit cleaner though there were still 7 outliers

I have some test VM's on GCE (standard-2) and they had 1 outlier though it's hard to be 100% sure since that VM is on the west coast.

The frame rate can be specified in the user data string as fps=10 or wpt_fps=10 (or on a test-by-test bases with fps=10 as a test parameter). I may also see about releasing new AMIs that default to 10fps.

Checking now to see if there is anything the agent can do to reduce the contention as well (using nice on ffmpeg for example).

wildlyinaccurate commented 6 years ago

This is super interesting! I still see outliers when I uncheck the Capture Video checkbox, but maybe that's not doing what I expect it to do. Do you have any ideas about why this would only happen with wptagent? Would it be worthwhile for me to attempt to diff the wptagent and wptdriver AMIs?

pmeenan commented 6 years ago

Video is technically always captured, just not processed and uploaded if the option isn't set. That way start render, speed index, etc can all still be calculated.

It might be worth testing the Windows wptagent AMI's to see if the Windows scheduler and traffic-shaping do better than Linux. My guess is some interaction between the guest and host where maybe the linux guest goes idle or for some reason the host isn't scheduling it fast enough.

Even on my desktop with just one agent running I still managed to reproduce it once out of 50 runs so even really lightly loaded doesn't look like it completely clears it.

I'm still playing around with a bunch of things but I pushed a change to lower the priority of video capture and to default to 10fps on EC2 unless overridden by user data or test option. That should help but may not eliminate it entirely.

I may experiment with different kernels in the guest to see if using a non-tickless kernel fixes it in case there is also some guest scheduling issues with netem.

pmeenan commented 6 years ago

With the latest changes, the EC2 linux wptagent agents are looking pretty good (one outlier).

I'm testing out the lowlatency kernel locally right now to see if it cleans up the remaining scheduler outliers.

pmeenan commented 6 years ago

The lowlatency kernel may have reduced the variability a bit but didn't change the outlier behavior. I'll see about trying it out on EC2 to see if it does better there.

pmeenan commented 6 years ago

Not sure it is completely solved but I just pushed out new AMI's with the lowlatency kernel and it's certainly a lot better. I hedge a bit because those are c5.large because US East doesn't have spot capacity for c4.large right now but it looks pretty good from California (c4) as well (one outlier).

I'm in the process of rolling out the lowlatency kernel to the Dulles agents right now but I get the feeling this is about as good as it is going to get.

wildlyinaccurate commented 6 years ago

Wow, amazing work! We'll roll out the updated agents and see what sort of impact that has. I'll get back to you once we have enough data to confirm the effect.

wildlyinaccurate commented 6 years ago

We've only been running tests on the updated agents for about 24 hours but so far we're not seeing a reduction in variability. Here are some tests from earlier in the week and also some from the last 24 hours. The tests from earlier in the week are all run from EC2 in ap-southeast-2 and sa-east-1. The tests from the last 24 hours are the same but with some from eu-west-2 just because I wondered whether the region would make a difference (fairly sure it doesn't).

I'll keep an eye on it over the next few days.

pmeenan commented 6 years ago

Argh. That sound you heard was me screaming. It seems to be really sensitive to timing because I don't see it when hitting https://agent.webpagetest.org/ (my origin) but hitting wildlyinaccurate causes it to happen a LOT even with the c5 instances.

I looked through a couple of the wildlyinaccurate tcpdumps where it did go wide and they all show the same pause and retransmission of the certificate packet with no real indication of why it chose to retransmit from there (the last ack out was for much earlier in the stream).

Digging into it more now to see if I can find anything. Looking into the ipfw configs for shaping and the tcp options for the kernel to see if I can find anything to help with it.

pmeenan commented 6 years ago

In testing it looks like if I only did latency but not bandwidth shaping it eliminated the issue. The shaping used 2 filters, netem for latency and packet loss followed by tbf for bandwidth. I'm trying out a change right now that simplifies it and does the bandwidth shaping in netem as well (I think older versions of netem didn't support rate limiting which is why it was split).

It'll take about an hour for the update to roll out to all of the agents and I'll do some testing again to see if it helped.

pmeenan commented 6 years ago

So far it is looking good but I thought that before as well:

wildlyinaccurate:

EC2 c5.large - No outliers
EC2 c4.large - No outliers
Dulles VM - No outliers

webpagetest:

EC2 c5.large - No outliers
EC2 c4.large - No outliers
Dulles VM - No outliers

There was some slight variation but nothing like it was before.

If you are using the public AMI's they should all be updated now as well so just keep an eye on it for the next few days.

wildlyinaccurate commented 6 years ago

I'm really sorry that this issue caused you to scream! The good news is that the last few test runs for a bunch of the sites I was monitoring are looking stable. I'll continue to monitor this weekend and close the issue if it's resolved.

pmeenan commented 6 years ago

Sorry, the screaming was directed at the tbf implementation in Linux. I'm very grateful for the report and continued investigation. Highly accurate network characteristics are the foundation for most of WebPageTest and I'd much rather know about an issue than not. Glad to hear it is looking cleaner. Fingers crossed it holds.

wildlyinaccurate commented 6 years ago

When looking at individual URLs it looks like the variability is totally gone and the backend times are much more predictable. I've also tried to aggregate the backend times across several hundred URLs but the data is too noisy. I'm pretty sure that the sample of individual URLs I'm looking at is fair, though (whatever that means) so I'm going to say that this issue is resolved.

Thanks a MILLION for chasing this one down. I learned a lot through your detailed debugging.

zeman commented 6 years ago

Woot, thanks @pmeenan

catchpoint / WebPageTest

Sporadically high "SSL Negotiation" times in new Linux agents #1106