Strange pipelines execution behaviour on Zuul

90n20 commented 1 month ago

Since mid-September, I have been observing anomalous behavior in the execution of security pipelines in Zuul, resulting in errors that prevent their execution.

As can be seen at , there are executions that proceed without any issues, however, many others fail, throwing the error "_Cannot retrieve result as autoremove is enabled"

At first, it might seem like an error in the Docker container being spawned, but both the container and the pipeline itself work correctly in a local test environment.

After reviewing the logs, I have not been able to determine the source of the problem, especially considering that it works and fails without following any logical pattern.

Could this be an issue related to Zuul or the worker executing the pipeline?

90n20 commented 1 week ago

After a deep analysis of pipeline/job triggers and their resulting logs, the nature of the "_Cannot retrieve result as autoremove is enabled" error in Zuul suggests that the issue with intermittent failures in Zuul is related to a combination of system resource settings and specific Docker configurations on the worker node, this is, it could be linked to the environment or configuration of the worker nodes rather than the pipeline itself.

Worker Node Configuration

After each pipeline run Zuul provides, the following worker specific logs:

host-info.ubuntu-jammy.yaml
inventory.yaml
zuul-info.ubuntu-jammy.txt

All worker nodes are identically configured, so docker configurations, resource limits and available disk space is the same on each run, which could contribute to this issue.

With the available information, there are no signs of heavy loads during the execution.

Intermittent issues can sometimes result from network-related problems between the Zuul controller and the workers or storage backends. This could explain that in some runs there are timeouts, delays, or disconnects affecting Docker instances.

It could be positive to examine the Zuul scheduler and executor logs for patterns, such as peak usage times or specific worker nodes more prone to failure. This could highlight systemic issues related to this topic.

Analysis of Potential Issues

Docker auto_remove Flag

The error message "Cannot retrieve result as auto_remove is enabled" suggests that the container may be automatically removed before Zuul can collect the necessary output. This setting could interfere with the job completion process, especially if the container exits quickly after finishing.

Although this behavior might not appear locally, different configurations on the worker node (especially around resource usage or Docker runtime) can lead to such discrepancies.

Resource Constraints (Memory and Swap)

The worker node has no swap memory configured, meaning any spike in memory usage could lead to an Out-of-Memory (OOM) kill on the job. With only physical memory available, sudden resource demands might terminate processes unexpectedly, especially under load.

However, this does not seem to be the case (but it is important to note it as it migh impact future interactions).

Network Configuration (MTU Size)

The available network interface on the worker has a high MTU of 8942. This could contribute to packet fragmentation if there are mismatches in MTU across network segments. Lowering the MTU to a standard 1500 may improve connectivity stability, reducing the risk of network-induced job failures.

AppArmor/Selinux Restrictions

AppArmor is enabled, hence its policies could be affecting container execution. While AppArmor generally allows containerized applications to function normally, specific restrictions might impact networking or filesystem operations in the container, leading to intermittent failures.

In fact, after talks with @gtema, we saw that selinux (in lab environments) block parts of the pipelines, like gvmd pid and sock creation and that some of the tools (projectdiscovery and greenbone ones) are not very tolerant to specific configurations, like being executed with root or non-root privileges and so.

Note that Selinux is disabled in the workers according to the logs.

DNS and Network Stability

The DNS configuration relies on a local resolver (127.0.0.53), which can sometimes cause intermittent resolution issues if the cache becomes unreliable. Configuring fallback nameservers might stabilize DNS lookups, especially for remote scan targets or API calls within the pipeline jobs.

All of this is being tested in a local Zuul instance, with pipelines debug mode enabled, in order to get more granularity on what is happening.

gtema commented 1 week ago

auto_remove flag was removed in last changes to hopefully allow us debug the issue better

90n20 commented 1 week ago

It seems that after PRs #10 and #13 have been merged, pipelines instability has been solved.

This leads to the assumption that errors could be related to resource constraints, as part of the changes introduced involve increasing worker VM vCPUs to 4 (use larger ubunty image).

We are still monitoring triggers in order to identify other potential issues and provide a better explanation.

SovereignCloudStack / security-infra-scan-pipeline

Strange pipelines execution behaviour on Zuul #15

Worker Node Configuration

Analysis of Potential Issues