elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.11k stars 4.91k forks source link

Random failures with filebeat Python unit tests (`tests/system/*.py`) #40237

Closed dliappis closed 4 weeks ago

dliappis commented 1 month ago

Flaky Test

filebeat/tests/system/test_reload_modules.py

Stack Trace

tests/system/test_reload_modules.py:162:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test_reload_modules.Test testMethod=test_start_stop>
cond = <function Test.test_start_stop.<locals>.<lambda> at 0x107171a20>
max_timeout = 5, poll_interval = 0.1, name = 'cond', err_msg = ''
    def wait_until(self, cond, max_timeout=20, poll_interval=0.1, name="cond", err_msg=""):
        """
        TODO: this can probably be a "wait_until_output_count", among other things, since that could actually use `self`, and this can become an internal function
        Waits until the cond function returns true,
        or until the max_timeout is reached. Calls the cond
        function every poll_interval seconds.
        If the max_timeout is reached before cond() returns
        true, an exception is raised.
        """
        start = datetime.now()
        while not cond():
            if datetime.now() - start > timedelta(seconds=max_timeout):
                print("Test has failed, here are the Beat logs")
                for l in self.get_log_lines():
                    print(l)
>               raise WaitTimeoutError(
                    f"Timeout waiting for condition '{name}'. Waited {max_timeout} seconds: {err_msg}")
E               beat.beat.WaitTimeoutError: Timeout waiting for condition 'cond'. Waited 5 seconds:
../libbeat/tests/system/beat/beat.py:449: WaitTimeoutError

However there are randomly many more python tests failing on this architecture e.g. tests/system/test_registrar.py.

See the failures in https://buildkite.com/elastic/filebeat/builds/7757#_ as an example of various different Python tests failing.

rowlandgeoff commented 1 month ago

@dliappis Should this be owned by elastic-agent-data-plane?

dliappis commented 1 month ago

@dliappis Should this be owned by elastic-agent-data-plane?

Hard to say by looking into https://github.com/elastic/beats/blob/d40a9e37285850a28ee95c474d4c9b3ac9e9e792/.github/CODEOWNERS#L29-L52 (there is no category for tests), but it seems that it falls in the default of /filebeat therefore @\elastic/elastic-agent-data-plane

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

dliappis commented 1 month ago

another occurrence: https://buildkite.com/elastic/beats/builds/10061

oakrizan commented 1 month ago

Win 10 tests\system\test_reload_inputs.py::Test::test_start_stop is failing frequently as well on 8.15 for Filebeat

cmacknz commented 1 month ago

These tests all seem like they interact with the disk by changing or creating files.

dliappis commented 1 month ago

Another occurrence in https://buildkite.com/elastic/filebeat/builds/8096#0190fe43-0eea-4be6-b97e-7af23c996ad3

VihasMakwana commented 1 month ago

@dliappis AFAIK, the issue lies with the flaky condition.

I'll raise a PR to fix it once I test it.

dliappis commented 1 month ago

@dliappis AFAIK, the issue lies with the flaky condition.

I'll raise a PR to fix it once I test it.

thank you @VihasMakwana! looking forward to the PR.

dliappis commented 4 weeks ago

Also seen on macOS x86_64, e.g. https://buildkite.com/elastic/filebeat/builds/8112#019102c2-b2f0-4d5e-9d1e-fea0709fd0f0 ; I changed the issue title accordingly. I believe it's the same root cause described in https://github.com/elastic/beats/issues/40237#issuecomment-2256030191

oakrizan commented 4 weeks ago

detected failures on main for following OS: Filebeat: Ubuntu x86_64 Unit Tests: https://buildkite.com/elastic/filebeat/builds/8084#0190f0cb-272d-44bf-9ecd-258706ab0c01/97-440 Filebeat: Win 10 Unit Tests: https://buildkite.com/elastic/filebeat/builds/8125#019103ca-184c-4981-b10a-63bdfaeaea68/65-691, https://buildkite.com/elastic/filebeat/builds/8088#0190fb32-0e95-419a-bdfa-7c6ae3b83d75/64-690 Filebeat: macOS x86_64 Unit Tests: https://buildkite.com/elastic/filebeat/builds/8112#019102c2-b2f0-4d5e-9d1e-fea0709fd0f0/1938-2344, https://buildkite.com/elastic/filebeat/builds/8092#0190fdc4-b6fc-4828-a17d-5b9e32360c48/1936-2342 Filebeat: macOS arm64 Unit Tests: https://buildkite.com/elastic/filebeat/builds/8096#0190fe02-4452-4815-b560-8f9502aae7c6/1859-2078

VihasMakwana commented 4 weeks ago

@dliappis I forgot to mention this earlier today, but I think these tests failures are not specific to any platform/architecture. They're flaky for all platforms I guess.

VihasMakwana commented 4 weeks ago

I've changed the title.