elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
91 stars 4.92k forks source link

Build timeouts on macos workflows #30510

Open matschaffer opened 2 years ago

matschaffer commented 2 years ago

I hit 4 macos failures while working on some beats PRs. All were macos jobs and appeared to be timeout related.

Flaky Test

Stack Trace

=================================== FAILURES ===================================
__________________ Test.test_http_endpoint_missing_auth_value __________________

self = <test_http_endpoint.Test testMethod=test_http_endpoint_missing_auth_value>

        def test_http_endpoint_missing_auth_value(self):
            """
            Test http_endpoint input with missing basic auth values.
            """
            options = """
      basic_auth: true
      username: testuser
      password:
    """
            self.get_config(options)
            filebeat = self.start_beat()
>           self.wait_until(lambda: self.log_contains(
                "username and password required when basicauth is enabled"))

tests/system/test_http_endpoint.py:192: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test_http_endpoint.Test testMethod=test_http_endpoint_missing_auth_value>
cond = <function Test.test_http_endpoint_missing_auth_value.<locals>.<lambda> at 0x10fb6e550>
max_timeout = 10, poll_interval = 0.1, name = 'cond'

    def wait_until(self, cond, max_timeout=10, poll_interval=0.1, name="cond"):
        """
        Waits until the cond function returns true,
        or until the max_timeout is reached. Calls the cond
        function every poll_interval seconds.

        If the max_timeout is reached before cond() returns
        true, an exception is raised.
        """
        start = datetime.now()
        while not cond():
            if datetime.now() - start > timedelta(seconds=max_timeout):
>               raise TimeoutError("Timeout waiting for '{}' to be true. ".format(name) +
                                   "Waited {} seconds.".format(max_timeout))
E               beat.beat.TimeoutError: Timeout waiting for 'cond' to be true. Waited 10 seconds.

../../libbeat/tests/system/beat/beat.py:365: TimeoutError
=================================== FAILURES ===================================
_________________ TestKeystore.test_keystore_with_present_key __________________

self = <test_keystore.TestKeystore testMethod=test_keystore_with_present_key>

    def test_keystore_with_present_key(self):
        """
        Test that we correctly do string replacement with values from the keystore
        """

        key = "elasticsearch_host"
        secret = "myeleasticsearchsecrethost"

        self.render_config_template(keystore_path=self.keystore_path, elasticsearch={
            'host': "${%s}:9200" % key
        })

        exit_code = self.run_beat(extra_args=["keystore", "create"])
        assert exit_code == 0

        self.add_secret(key, secret, True)
        proc = self.start_beat()
        self.wait_until(lambda: self.log_contains("myeleasticsearchsecrethost"))
        assert self.log_contains(secret)
>       proc.kill_and_wait()

tests/system/test_keystore.py:36: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../libbeat/tests/system/beat/beat.py:142: in kill_and_wait
    return self.wait()
../libbeat/tests/system/beat/beat.py:124: in wait
    return self.proc.wait()
/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py:1189: in wait
    return self._wait(timeout=timeout)
/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py:1917: in _wait
    (pid, sts) = self._try_wait(0)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Popen: returncode: None args: ['/Users/runner/work/beats/beats/filebeat/fil...>
wait_flags = 0

    def _try_wait(self, wait_flags):
        """All callers to this function MUST hold self._waitpid_lock."""
        try:
>           (pid, sts) = os.waitpid(self.pid, wait_flags)
E           Failed: Timeout >90.0s

/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py:[187](https://github.com/elastic/beats/runs/5268753394?check_suite_focus=true#step:7:187)5: Failed
=================================== FAILURES ===================================
___________________________ Test.test_non_recursive ____________________________

self = <test_file_integrity.Test testMethod=test_non_recursive>

    @unittest.skipIf(os.getenv("BUILD_ID") is not None and platform.system() == 'Darwin',
                     'Flaky test: elastic/beats/issues/246[78](https://github.com/elastic/beats/runs/5268753269?check_suite_focus=true#step:7:78)')
    def test_non_recursive(self):
        """
        file_integrity monitors watched directories (non recursive).
        """

        dirs = [self.temp_dir("auditbeat_test"),
                self.temp_dir("auditbeat_test")]

        with PathCleanup(dirs):
            self.render_config_template(
                modules=[{
                    "name": "file_integrity",
                    "extras": {
                        "paths": dirs,
                        "scan_at_start": False
                    }
                }],
            )
            proc = self.start_beat()

            # wait until the directories to watch are printed in the logs
            # this happens when the file_integrity module starts.
            # Case must be ignored under windows as capitalisation of paths
            # may differ
            self.wait_log_contains(escape_path(dirs[0]), max_timeout=30, ignore_case=True)

            file1 = os.path.join(dirs[0], 'file.txt')
            self.create_file(file1, "hello world!")

            file2 = os.path.join(dirs[1], 'file2.txt')
            self.create_file(file2, "Foo bar")

            # wait until file1 is reported before deleting. Otherwise the hash
            # might not be calculated
>           self.wait_log_contains("\"path\":\"{0}\"".format(escape_path(file1)), ignore_case=True)

tests/system/test_file_integrity.py:101: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../libbeat/tests/system/beat/beat.py:471: in wait_log_contains
    self.wait_until(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test_file_integrity.Test testMethod=test_non_recursive>
cond = <function TestCase.wait_log_contains.<locals>.<lambda> at 0x10520e[79](https://github.com/elastic/beats/runs/5268753269?check_suite_focus=true#step:7:79)0>
max_timeout = 10, poll_interval = 0.1, name = 'log_contains'

    def wait_until(self, cond, max_timeout=10, poll_interval=0.1, name="cond"):
        """
        TODO: this can probably be a "wait_until_output_count", among other things, since that could actually use `self`, and this can become an internal function
        Waits until the cond function returns true,
        or until the max_timeout is reached. Calls the cond
        function every poll_interval seconds.

        If the max_timeout is reached before cond() returns
        true, an exception is raised.
        """
        start = datetime.now()
        while not cond():
            if datetime.now() - start > timedelta(seconds=max_timeout):
>               raise WaitTimeoutError(
                    f"Timeout waiting for condition '{name}'. Waited {max_timeout} seconds.")
E               beat.beat.WaitTimeoutError: Timeout waiting for condition 'log_contains'. Waited 10 seconds.

../libbeat/tests/system/beat/beat.py:429: WaitTimeoutError
=== Failed
Error: go test returned a non-zero value: exit status 1
=== FAIL: heartbeat/scheduler TestSchedTaskLimits/runs_100_with_limit_not_configured (0.00s)
    testing.go:1169: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
    --- FAIL: TestSchedTaskLimits/runs_100_with_limit_not_configured (0.00s)

=== FAIL: heartbeat/scheduler TestSchedTaskLimits (0.00s)
    assertion_compare.go:323: 
            Error Trace:    scheduler_test.go:239
                                        scheduler_test.go:268
            Error:          "5" is not greater than or equal to "100"
            Test:           TestSchedTaskLimits
            Messages:       []
matschaffer commented 2 years ago

3 passed on the first retry.

The auditbeat test continues to fail. The test looks like it's supposed to be skipped https://github.com/matschaffer/beats/blob/987250d475a88279782c00ec860acdf7f34c3741/auditbeat/tests/system/test_file_integrity.py#L66

But it's running anyway. I'm guessing BUILD_ID is something that jenkins was providing at the time the skip was added. Thinking we should probably add a CI check since github provides that https://docs.github.com/en/actions/learn-github-actions/environment-variables and jenkins probably should if it doesn't already.