envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.65k stars 4.74k forks source link

FilterIntegrationTest.AltSvcCachedH2Slow is flakey #32151

Open RyanTheOptimist opened 7 months ago

RyanTheOptimist commented 7 months ago

I've seen this test flake from time to time in CI. Here's a recent example:

https://dev.azure.com/cncf/envoy/_build/results?buildId=161250&view=logs&jobId=767be981-567e-57d8-68c3-2140ede0a0bd&j=767be981-567e-57d8-68c3-2140ede0a0bd&t=2181edf2-f610-59f2-c43a-04bb9d0bca00

@abeyad @alyssawilk

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

jmarantz commented 6 months ago

This flaked for me in an unrelated PR. Revert for now?

https://dev.azure.com/cncf/envoy/_build/results?buildId=164249&view=logs&j=767be981-567e-57d8-68c3-2140ede0a0bd&t=2181edf2-f610-59f2-c43a-04bb9d0bca00

2024-03-06T20:24:18.6721900Z [ RUN      ] Protocols/FilterIntegrationTest.AltSvcCachedH2Slow/IPv6_Http2Downstream_Http3UpstreamHttpParserNghttp2WithDeferredProcessingLegacy
2024-03-06T20:24:18.6723371Z ./test/integration/server.h:455: Failure
2024-03-06T20:24:18.6724597Z Value of: TestUtility::waitForGaugeEq(statStore(), name, value, time_system_, timeout)
2024-03-06T20:24:18.6726361Z   Actual: false (timed out waiting for cluster.cluster_0.circuit_breakers.default.remaining_cx_pools to be 100, current value 99)
2024-03-06T20:24:18.6727727Z Expected: true
2024-03-06T20:24:18.6728283Z Stack trace:
2024-03-06T20:24:18.6729250Z   0xaec5fc: Envoy::(anonymous namespace)::FilterIntegrationTest_AltSvcCachedH2Slow_Test::TestBody()
2024-03-06T20:24:18.6730644Z   0x1d5eec8: testing::internal::HandleExceptionsInMethodIfSupported<>()
2024-03-06T20:24:18.6731596Z   0x1d5ed64: testing::Test::Run()
2024-03-06T20:24:18.6732399Z   0x1d5ff84: testing::TestInfo::Run()
2024-03-06T20:24:18.6733088Z ... Google Test internal frames ...
2024-03-06T20:24:18.6733567Z 
2024-03-06T20:24:18.6734954Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_enable_universal_header_validator to: false
2024-03-06T20:24:18.6737303Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_http1_use_balsa_parser to: true
2024-03-06T20:24:18.6739581Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
2024-03-06T20:24:18.6741870Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
2024-03-06T20:24:18.6744178Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
2024-03-06T20:24:18.6746451Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_http2_use_oghttp2 to: true
2024-03-06T20:24:18.6749185Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_defer_processing_backedup_streams to: true
2024-03-06T20:24:18.6752775Z [  FAILED  ] 
abeyad commented 6 months ago

sorry just saw this, yes I'm noticing it flaked again too in other runs of CI, I'll revert the PR

abeyad commented 6 months ago

@alyssawilk this test flakes with this error:

[ RUN      ] Protocols/FilterIntegrationTest.AltSvcCachedH2Slow/IPv4_Http2Downstream_Http3UpstreamHttpParserNghttp2NoDeferredProcessingLegacy
./test/integration/server.h:455: Failure
Value of: TestUtility::waitForGaugeEq(statStore(), name, value, time_system_, timeout)
  Actual: false (timed out waiting for cluster.cluster_0.circuit_breakers.default.remaining_cx_pools to be 100, current value 99)
Expected: true
Stack trace:
  0xaebfbc: Envoy::(anonymous namespace)::FilterIntegrationTest_AltSvcCachedH2Slow_Test::TestBody()
  0x1d5e3f8: testing::internal::HandleExceptionsInMethodIfSupported<>()
  0x1d5e294: testing::Test::Run()
  0x1d5f4b4: testing::TestInfo::Run()
... Google Test internal frames ...

For example: https://dev.azure.com/cncf/envoy/_build/results?buildId=164325&view=logs&j=767be981-567e-57d8-68c3-2140ede0a0bd&t=2181edf2-f610-59f2-c43a-04bb9d0bca00

I can't reproduce it locally after 1000 runs.

It's happening on this line in the test: https://github.com/envoyproxy/envoy/blob/b7bc42064b030c6bae4998e3b1568bb0023562b9/test/extensions/filters/http/alternate_protocols_cache/filter_integration_test.cc#L325.

We don't set a timeout in our waitForGaugeEq call, so the DEFAULT_TIMEOUT of 10s is used: https://github.com/envoyproxy/envoy/blob/b7bc42064b030c6bae4998e3b1568bb0023562b9/test/integration/server.h#L454. That should be enough for the pool to tear down.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

adisuissa commented 4 months ago

Just observed another test flake failure, but maybe it is of a different nature:

[ RUN      ] Protocols/FilterIntegrationTest.AltSvcCachedH2Slow/IPv4_Http2Downstream_Http3UpstreamHttpParserOghttp2WithDeferredProcessingLegacy
[mutex.cc : 1959] RAW: Check (v & (kMuWait | kMuWrWait)) != kMuWrWait failed: Lock: Mutex corrupt: waiting writer with no waiters: 0x67efe291e60
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x3e900006c1c
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: 0/1.29.4-dev/test/RELEASE/BoringSSL
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:96] #0: __kernel_rt_sigreturn [0xffffb41d38fc]
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:96] #1: abort [0xffffb3f48aac]
[2024-04-18 11:09:45.780][28517][critical][backtrace] [./source/server/backtrace.h:96] #2: absl::lts_20230802::raw_log_internal::RawLog() [0x1dc3944]
[2024-04-18 11:09:45.790][28517][critical][backtrace] [./source/server/backtrace.h:96] #3: absl::lts_20230802::Mutex::LockSlowLoop() [0x1d9c404]
[2024-04-18 11:09:45.801][28517][critical][backtrace] [./source/server/backtrace.h:96] #4: absl::lts_20230802::Mutex::LockSlowWithDeadline() [0x1d9b568]
[2024-04-18 11:09:45.811][28517][critical][backtrace] [./source/server/backtrace.h:96] #5: absl::lts_20230802::Mutex::LockSlow() [0x1d9b364]
[2024-04-18 11:09:45.821][28517][critical][backtrace] [./source/server/backtrace.h:96] #6: absl::lts_20230802::Notification::WaitForNotification() [0x1d9e010]
[2024-04-18 11:09:45.831][28517][critical][backtrace] [./source/server/backtrace.h:96] #7: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x1742418]
[2024-04-18 11:09:45.840][28517][critical][backtrace] [./source/server/backtrace.h:96] #8: event_process_active_single_queue [0x1bea698]
[2024-04-18 11:09:45.850][28517][critical][backtrace] [./source/server/backtrace.h:96] #9: event_base_loop [0x1be91dc]
[2024-04-18 11:09:45.860][28517][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::FakeUpstream::threadRoutine() [0xbadad0]
[2024-04-18 11:09:45.870][28517][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x1be1b84]
[2024-04-18 11:09:45.870][28517][critical][backtrace] [./source/server/backtrace.h:96] #12: start_thread [0xffffb40a2624]

Original details here: https://dev.azure.com/cncf/envoy/_build/results?buildId=168331&view=logs&j=767be981-567e-57d8-68c3-2140ede0a0bd&t=2181edf2-f610-59f2-c43a-04bb9d0bca00&l=289