Open RyanTheOptimist opened 7 months ago
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This flaked for me in an unrelated PR. Revert for now?
2024-03-06T20:24:18.6721900Z [ RUN ] Protocols/FilterIntegrationTest.AltSvcCachedH2Slow/IPv6_Http2Downstream_Http3UpstreamHttpParserNghttp2WithDeferredProcessingLegacy
2024-03-06T20:24:18.6723371Z ./test/integration/server.h:455: Failure
2024-03-06T20:24:18.6724597Z Value of: TestUtility::waitForGaugeEq(statStore(), name, value, time_system_, timeout)
2024-03-06T20:24:18.6726361Z Actual: false (timed out waiting for cluster.cluster_0.circuit_breakers.default.remaining_cx_pools to be 100, current value 99)
2024-03-06T20:24:18.6727727Z Expected: true
2024-03-06T20:24:18.6728283Z Stack trace:
2024-03-06T20:24:18.6729250Z 0xaec5fc: Envoy::(anonymous namespace)::FilterIntegrationTest_AltSvcCachedH2Slow_Test::TestBody()
2024-03-06T20:24:18.6730644Z 0x1d5eec8: testing::internal::HandleExceptionsInMethodIfSupported<>()
2024-03-06T20:24:18.6731596Z 0x1d5ed64: testing::Test::Run()
2024-03-06T20:24:18.6732399Z 0x1d5ff84: testing::TestInfo::Run()
2024-03-06T20:24:18.6733088Z ... Google Test internal frames ...
2024-03-06T20:24:18.6733567Z
2024-03-06T20:24:18.6734954Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_enable_universal_header_validator to: false
2024-03-06T20:24:18.6737303Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_http1_use_balsa_parser to: true
2024-03-06T20:24:18.6739581Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
2024-03-06T20:24:18.6741870Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
2024-03-06T20:24:18.6744178Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
2024-03-06T20:24:18.6746451Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_http2_use_oghttp2 to: true
2024-03-06T20:24:18.6749185Z [external/com_google_absl/absl/flags/internal/flag.cc : 115] RAW: Restore saved value of envoy_reloadable_features_defer_processing_backedup_streams to: true
2024-03-06T20:24:18.6752775Z [ FAILED ]
sorry just saw this, yes I'm noticing it flaked again too in other runs of CI, I'll revert the PR
@alyssawilk this test flakes with this error:
[ RUN ] Protocols/FilterIntegrationTest.AltSvcCachedH2Slow/IPv4_Http2Downstream_Http3UpstreamHttpParserNghttp2NoDeferredProcessingLegacy
./test/integration/server.h:455: Failure
Value of: TestUtility::waitForGaugeEq(statStore(), name, value, time_system_, timeout)
Actual: false (timed out waiting for cluster.cluster_0.circuit_breakers.default.remaining_cx_pools to be 100, current value 99)
Expected: true
Stack trace:
0xaebfbc: Envoy::(anonymous namespace)::FilterIntegrationTest_AltSvcCachedH2Slow_Test::TestBody()
0x1d5e3f8: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x1d5e294: testing::Test::Run()
0x1d5f4b4: testing::TestInfo::Run()
... Google Test internal frames ...
I can't reproduce it locally after 1000 runs.
It's happening on this line in the test: https://github.com/envoyproxy/envoy/blob/b7bc42064b030c6bae4998e3b1568bb0023562b9/test/extensions/filters/http/alternate_protocols_cache/filter_integration_test.cc#L325.
We don't set a timeout in our waitForGaugeEq call, so the DEFAULT_TIMEOUT of 10s is used: https://github.com/envoyproxy/envoy/blob/b7bc42064b030c6bae4998e3b1568bb0023562b9/test/integration/server.h#L454. That should be enough for the pool to tear down.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Just observed another test flake failure, but maybe it is of a different nature:
[ RUN ] Protocols/FilterIntegrationTest.AltSvcCachedH2Slow/IPv4_Http2Downstream_Http3UpstreamHttpParserOghttp2WithDeferredProcessingLegacy
[mutex.cc : 1959] RAW: Check (v & (kMuWait | kMuWrWait)) != kMuWrWait failed: Lock: Mutex corrupt: waiting writer with no waiters: 0x67efe291e60
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x3e900006c1c
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: 0/1.29.4-dev/test/RELEASE/BoringSSL
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:96] #0: __kernel_rt_sigreturn [0xffffb41d38fc]
[2024-04-18 11:09:45.770][28517][critical][backtrace] [./source/server/backtrace.h:96] #1: abort [0xffffb3f48aac]
[2024-04-18 11:09:45.780][28517][critical][backtrace] [./source/server/backtrace.h:96] #2: absl::lts_20230802::raw_log_internal::RawLog() [0x1dc3944]
[2024-04-18 11:09:45.790][28517][critical][backtrace] [./source/server/backtrace.h:96] #3: absl::lts_20230802::Mutex::LockSlowLoop() [0x1d9c404]
[2024-04-18 11:09:45.801][28517][critical][backtrace] [./source/server/backtrace.h:96] #4: absl::lts_20230802::Mutex::LockSlowWithDeadline() [0x1d9b568]
[2024-04-18 11:09:45.811][28517][critical][backtrace] [./source/server/backtrace.h:96] #5: absl::lts_20230802::Mutex::LockSlow() [0x1d9b364]
[2024-04-18 11:09:45.821][28517][critical][backtrace] [./source/server/backtrace.h:96] #6: absl::lts_20230802::Notification::WaitForNotification() [0x1d9e010]
[2024-04-18 11:09:45.831][28517][critical][backtrace] [./source/server/backtrace.h:96] #7: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x1742418]
[2024-04-18 11:09:45.840][28517][critical][backtrace] [./source/server/backtrace.h:96] #8: event_process_active_single_queue [0x1bea698]
[2024-04-18 11:09:45.850][28517][critical][backtrace] [./source/server/backtrace.h:96] #9: event_base_loop [0x1be91dc]
[2024-04-18 11:09:45.860][28517][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::FakeUpstream::threadRoutine() [0xbadad0]
[2024-04-18 11:09:45.870][28517][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x1be1b84]
[2024-04-18 11:09:45.870][28517][critical][backtrace] [./source/server/backtrace.h:96] #12: start_thread [0xffffb40a2624]
Original details here: https://dev.azure.com/cncf/envoy/_build/results?buildId=168331&view=logs&j=767be981-567e-57d8-68c3-2140ede0a0bd&t=2181edf2-f610-59f2-c43a-04bb9d0bca00&l=289
I've seen this test flake from time to time in CI. Here's a recent example:
https://dev.azure.com/cncf/envoy/_build/results?buildId=161250&view=logs&jobId=767be981-567e-57d8-68c3-2140ede0a0bd&j=767be981-567e-57d8-68c3-2140ede0a0bd&t=2181edf2-f610-59f2-c43a-04bb9d0bca00
@abeyad @alyssawilk