aws / s2n-tls

An implementation of the TLS/SSL protocols
https://aws.github.io/s2n-tls/usage-guide/
Apache License 2.0
4.51k stars 704 forks source link

ci: flaky well-known endpoints test #3999

Open jmayclin opened 1 year ago

jmayclin commented 1 year ago

Problem:

The well-known endpoints test occasionally fails to negotiate a TLS connection with Amazon.com

more errors, for well-known-endpoint Amazon on TLS 1.0

__________ test_well_known_endpoints[None-S2N-www.amazon.com-TLS1.0] ___________
Command '['s2nc', '--non-blocking', '-e', '-T', '-f', 
'../pems/trust-store/ca-bundle.trust.crt', '-c', 'test_all_tls12', 
'--enter-fips-mode', 'www.amazon.com', '443']' timed out after 5 seconds
 s2nc --non-blocking -e -T -f ../pems/trust-store/ca-bundle.trust.crt -c
 test_all_tls12 --enter-fips-mode www.amazon.com 443

log link

Solution:

Uncertain at the moment. I'm going to continue to paste failure here as I see them, until I can hopefully establish a pattern of behavior. E.g. does it always fail on the same TLS version?

Requirements / Acceptance Criteria:

This test must be reliable enough that when it fails my first thought should be "oh no, I have broken something with TLS negotiation" and not "I need to restart the test"

jmayclin commented 1 year ago

TLS 1.2 failure with amazon.com

logs

FAILED test_well_known_endpoints.py::test_well_known_endpoints[KMS-PQ-TLS-1-0-2019-06-S2N-www.amazon.com-TLS1.2]
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
============== 1 failed, 75 passed, 10 rerun in 62.03s (0:01:02) ===============
py39: exit 2 (62.25 seconds) /codebuild/output/src275075414/src/github.com/aws/s2n-tls/tests/integrationv2> pytest -x -n=2 --maxfail=1 --reruns=2 --cache-clear -rpfsq -o log_cli=true --log-cli-level=INFO --provider-version=openssl-1.0.2 --provider-criterion=off --fips-mode=0 --no-pq=0 /codebuild/output/src275075414/src/github.com/aws/s2n-tls/tests/integrationv2/test_well_known_endpoints.py pid=23849
  py39: FAIL code 2 (62.68=setup[0.43]+cmd[62.25] seconds)
  evaluation failed :( (62.74 seconds)
AssertionError: Command '['s2nc', '--non-blocking', '-e', '-T', '-f', '../pems/trust-store/ca-bundle.crt', '-c', 'KMS-PQ-TLS-1-0-2019-06', 'www.amazon.com', '443']' timed out after 5 seconds s2nc --non-blocking -e -T -f ../pems/trust-store/ca-bundle.crt -c KMS-PQ-TLS-1-0-2019-06 www.amazon.com 443
jmayclin commented 1 year ago

TLS 1.1 failure with amazon.com

FAILED test_well_known_endpoints.py::test_well_known_endpoints[PQ-SIKE-TEST-TLS-1-0-2019-11-S2N-www.amazon.com-TLS1.1]
maddeleine commented 1 year ago

This is kind of an interesting idea from this issue #756. We could run both s_client and s2nc and the test would fail if only s2nc fails, otherwise we conclude there's something wrong with the endpoint and not us.

jmayclin commented 1 year ago

Certainly seems worth a try.

Although the easier solution is to just remove it from our list and replace it with some other well-known endpoint.

I'd also love for this to be considered a high priority, since the addition of the nix job in CI checks means that the failure rate of this test is now worse because we need 2 successes instead of just the 1.