falcosecurity / test-infra

Falco workflow & testing infrastructure
https://prow.falco.org
Apache License 2.0
30 stars 109 forks source link

Drivers build not triggered for x86_64 #1158

Open maxgio92 opened 1 year ago

maxgio92 commented 1 year ago

Describe the bug

Not all driver building-postsubmit jobs have been triggered after new driverkit configurations have been introduced - as usual weekly crawling result from kernel-crawler.

How to reproduce it

Please check:

Expected behaviour

All the jobs for x86 to have been triggered accordingly.

FedeDP commented 1 year ago

Note: all the jobs have always been instead triggered correctly during normal weekly kernel-crawler CI. This time, instead, as we added new driver version (therefore copy/pasting an existing driver version to a new folder) and removed the oldest one (2.0.0+driver), not all jobs were triggered.

At the same time, PR CI correctly triggered all presubmit jobs instead.

FedeDP commented 1 year ago

I might have found a good candidate; note the difference between the presubmit and postsubmit Prow documentation about run_if_changed:

Pre:

Presubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified in any of the commits in the pull request.

Post:

Postsubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified by the commits included in the specific push event from github.

So, postsubmit are based on the specific push event triggered by github. What if, given the large size of the diff/high number of files changed, the push event is somewhat truncated, ie: not all changed files are pushed?

Payloads are capped to 25M: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads#push

Note: Payloads are capped at 25 MB. If your event generates a larger payload, a webhook will not be fired. This may happen, for example, on a create event if many branches or tags are pushed at once. We suggest monitoring your payload size to ensure delivery.

But this is not our case imho because Prow receives the event with a full payload (otherwise it won't be able to decode it); but there might be some other hard limits on the number of files changes pushed with the event.

Prow code confirms this behavior: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/push.go#L28: as you can see, it uses PushEvent data to fetch all changed files. Note how the presubmit instead uses normal git client to fetch list of changed files: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/pull-request.go#L358C37-L358C37 (https://github.com/kubernetes/test-infra/blob/master/prow/config/jobs.go#L537)

FedeDP commented 1 year ago

So, github http api does limit list of files changed to 3000: https://docs.github.com/en/rest/pulls/pulls?apiVersion=2022-11-28#list-pull-requests-files

This is the method used to fetch pull request changes; therefore i'd expect them to be cut at 3000 files too. Instead, we are experiencing the issue only on push to master. Push events have the same 3000 changes hard limit.

I don't get why this is working on pull request but not on push to master :/ I'd expect the issue to be present on pull request too!

FedeDP commented 1 year ago

I now get why it is working on presubmits! Our presubmits jobs are not arch dependent; therefore we are effectively mostly triggering on aarch64 configs, BUT we trigger presubmit jobs for both arch (since they are not arch dependent). On postsubmit, instead, given that we have aarch64 and x86_64 build jobs, we mostly trigger only aarch64 build and some x86_64 until we reach the 3000 files changed limit.

FedeDP commented 1 year ago

I see 2 main paths forward:

FedeDP commented 12 months ago

Adding 2 more possible "fixes":

FedeDP commented 10 months ago

split our workflow to open up a PR on test-infra for each crawled distro; this does not require any change to Prow (so, much quicker fix), but at the same time, it means we will open (and have to check/approve) like 15PRs each week...

Run the update-kernels workflow daily instead of weekly: running it more frequently will generate less diff thus we would be more confident that all jobs get triggered; it has also the benefit that we would be much quicker to support new kernels

Note: these ones would not fix the "new driverversion added" issues.

FedeDP commented 10 months ago

A short term solution would be to change all post submit triggers to trigger to any change to driverkit/config/... folder, ie from this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'

to this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

PROs:

CONs:

Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods). The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).

@maxgio92 wdyt?

FedeDP commented 10 months ago

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

We could also let the entrypoint of each job do the diff HEAD~1 and decide whether it should do anything, even if i'd avoid adding too much logic in those scripts.

maxgio92 commented 10 months ago

A short term solution would be to change all post submit triggers to trigger to any change to driverkit/config/... folder, ie from this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'

to this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

PROs:

* super easy solution

CONs:

* we would spawn useless pods

* we would spawn more nodes than needed

Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods). The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).

@maxgio92 wdyt?

Thank you for this detailed investigation and sorry for my late response.

I'd try as hard as possible to avoid triggering all distros jobs.

maxgio92 commented 10 months ago

What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example, @FedeDP?

PS: in any case IMHO are both short term solutions.

FedeDP commented 10 months ago

What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example

That would unfortunately not solve the issues when we add a completely new driver version from scratch.

FedeDP commented 10 months ago

PS: in any case IMHO are both short term solutions.

Agree btw :/

poiana commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP commented 7 months ago

/remove-lifecycle stale

poiana commented 4 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP commented 4 months ago

/remove-lifecycle stale

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP commented 1 month ago

/remove-lifecycle stale