elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.16k stars 4.91k forks source link

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on `8.13` branch #39613

Closed dliappis closed 4 months ago

dliappis commented 5 months ago

Flaky Test

Stack Trace

tests/system/test_registrar.py

Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f825c-39a5-4588-8257-9bac31116e69/1804-1935

=================================== FAILURES ===================================
________________________ Test.test_restart_state_reset _________________________
self = <test_registrar.Test testMethod=test_restart_state_reset>
    def test_restart_state_reset(self):
        """
        Test that ttl is set to -1 after restart and no inputs covering it
        """
        self.render_config_template(
            path=os.path.abspath(self.working_dir) + "/log/*",
            clean_inactive="10s",
            ignore_older="5s"
        )
        os.mkdir(self.working_dir + "/log/")
        testfile_path = self.working_dir + "/log/test.log"
        with open(testfile_path, 'w') as testfile:
            testfile.write("Hello World\n")
        filebeat = self.start_beat()
        # Wait until state written
>       self.wait_until(
            lambda: self.output_has(lines=1),
            max_timeout=30)
tests/system/test_registrar.py:991:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test_registrar.Test testMethod=test_restart_state_reset>
cond = <function Test.test_restart_state_reset.<locals>.<lambda> at 0x10eb79870>
max_timeout = 30, poll_interval = 0.1, name = 'cond', err_msg = ''
    def wait_until(self, cond, max_timeout=20, poll_interval=0.1, name="cond", err_msg=""):
        """
        TODO: this can probably be a "wait_until_output_count", among other things, since that could actually use `self`, and this can become an internal function
        Waits until the cond function returns true,
        or until the max_timeout is reached. Calls the cond
        function every poll_interval seconds.
        If the max_timeout is reached before cond() returns
        true, an exception is raised.
        """
        start = datetime.now()
        while not cond():
            if datetime.now() - start > timedelta(seconds=max_timeout):
                print("Test has failed, here are the Beat logs")
                for l in self.get_log_lines():
                    print(l)
>               raise WaitTimeoutError(
                    f"Timeout waiting for condition '{name}'. Waited {max_timeout} seconds: {err_msg}")
E               beat.beat.WaitTimeoutError: Timeout waiting for condition 'cond'. Waited 30 seconds:
../libbeat/tests/system/beat/beat.py:449: WaitTimeoutError

tests/system/test_shutdown.py

Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f858e-f9e1-46d0-ac2a-806e13454195/1889-1939

______________________________ Test.test_shutdown ______________________________
self = <test_shutdown.Test testMethod=test_shutdown>
    @unittest.skipIf(platform.platform().startswith("Windows-7"),
                     "Flaky test: https://github.com/elastic/beats/issues/22795")
    def test_shutdown(self):
        """
        Test starting and stopping Filebeat under load.
        """
        self.nasa_logs()
        self.render_config_template(
            path=os.path.abspath(self.working_dir) + "/log/*",
            ignore_older="1h"
        )
        for i in range(1, 5):
            proc = self.start_beat(logging_args=["-e", "-v"])
            time.sleep(.5)
>           proc.check_kill_and_wait()
tests/system/test_shutdown.py:31:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../libbeat/tests/system/beat/beat.py:153: in check_kill_and_wait
    return self.check_wait(exit_code=exit_code)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <beat.beat.Proc object at 0x105d7a9e0>, exit_code = 0
    def check_wait(self, exit_code=0):
        """
        check_wait waits for the process to exit, and checks the return code of the process
        """
        actual_exit_code = self.wait()
>       assert actual_exit_code == exit_code, f"Expected exit code to be {exit_code}, but it was {actual_exit_code}"
E       AssertionError: Expected exit code to be 0, but it was -15
../libbeat/tests/system/beat/beat.py:133: AssertionError
- generated xml file: /Users/admin/builds/bk-agent-prod-orka-1715932705515043799/elastic/filebeat/filebeat/build/TEST-python-unit.xml -
============================= slowest 20 durations =============================
27.73s call     filebeat/tests/system/test_crawler.py::Test::test_encodings
15.23s call     filebeat/tests/system/test_harvester.py::Test::test_close_timeout
10.66s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_input
9.70s call     filebeat/tests/system/test_stdin.py::Test::test_stdin_eof
9.20s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_with_space
8.49s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state
8.15s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_no_clean_inactive
8.05s call     filebeat/tests/system/test_registrar.py::Test::test_registry_file_default_permissions
7.30s call     filebeat/tests/system/test_registrar.py::Test::test_rotating_file_with_restart
7.28s call     filebeat/tests/system/test_registrar.py::Test::test_rotating_file_inode
7.03s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl
6.95s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_add
6.91s call     filebeat/tests/system/test_reload_inputs.py::Test::test_start_stop
6.87s call     filebeat/tests/system/test_registrar.py::Test::test_state_after_rotation_ignore_older
6.82s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_config
6.54s call     filebeat/tests/system/test_harvester.py::Test::test_decode_error
6.48s call     filebeat/tests/system/test_crawler.py::Test::test_utf8
6.36s call     filebeat/tests/system/test_input.py::Test::test_rotating_close_inactive_low_write_rate
6.27s call     filebeat/tests/system/test_input.py::Test::test_harvester_limit
6.13s call     filebeat/tests/system/test_input.py::Test::test_input_filter_dropfields
=========================== short test summary info ============================
FAILED tests/system/test_shutdown.py::Test::test_shutdown - AssertionError: Expected exit code to be 0, but it was -15
============ 1 failed, 189 passed, 175 skipped in 608.83s (0:10:08) ============
elasticmachine commented 5 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

dliappis commented 4 months ago

Also seen in CI runs for merge commits against main e.g. in https://buildkite.com/elastic/filebeat/builds/6466

alexsapran commented 4 months ago

@pierrehilbert this is something that is affecting your CI metrics.

VihasMakwana commented 4 months ago

This test is related to this https://elastic.slack.com/archives/C047NDNGUMU/p1717091056290869. On MacOS, the FQDN lookup takes some time (~5 seconds) before signal handlers are even set up.

We can fix this behaviour in either of the following ways:

  1. Wait for a bit longer (~10 seconds) for MacOS for the filebeat to start up.
  2. Skip this test, as the underlying FQDN is itself flaky and throws warning.

cc: @pierrehilbert

pierrehilbert commented 4 months ago

Let's go with the option 1 and if this is not solving the issue, let's skip the test only on MacOS