Open maxgio92 opened 1 year ago
Note: all the jobs have always been instead triggered correctly during normal weekly kernel-crawler CI. This time, instead, as we added new driver version (therefore copy/pasting an existing driver version to a new folder) and removed the oldest one (2.0.0+driver), not all jobs were triggered.
At the same time, PR CI correctly triggered all presubmit jobs instead.
I might have found a good candidate; note the difference between the presubmit and postsubmit Prow documentation about run_if_changed
:
Pre:
Presubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified in any of the commits in the pull request.
Post:
Postsubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified by the commits included in the specific push event from github.
So, postsubmit are based on the specific push event triggered by github. What if, given the large size of the diff/high number of files changed, the push event is somewhat truncated, ie: not all changed files are pushed?
Payloads are capped to 25M: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads#push
Note: Payloads are capped at 25 MB. If your event generates a larger payload, a webhook will not be fired. This may happen, for example, on a create event if many branches or tags are pushed at once. We suggest monitoring your payload size to ensure delivery.
But this is not our case imho because Prow receives the event with a full payload (otherwise it won't be able to decode it); but there might be some other hard limits on the number of files changes pushed with the event.
Prow code confirms this behavior: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/push.go#L28: as you can see, it uses PushEvent
data to fetch all changed files.
Note how the presubmit instead uses normal git client to fetch list of changed files: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/pull-request.go#L358C37-L358C37 (https://github.com/kubernetes/test-infra/blob/master/prow/config/jobs.go#L537)
So, github http api does limit list of files changed to 3000: https://docs.github.com/en/rest/pulls/pulls?apiVersion=2022-11-28#list-pull-requests-files
This is the method used to fetch pull request changes; therefore i'd expect them to be cut at 3000 files too. Instead, we are experiencing the issue only on push to master. Push events have the same 3000 changes hard limit.
I don't get why this is working on pull request but not on push to master :/ I'd expect the issue to be present on pull request too!
I now get why it is working on presubmits! Our presubmits jobs are not arch dependent; therefore we are effectively mostly triggering on aarch64 configs, BUT we trigger presubmit jobs for both arch (since they are not arch dependent). On postsubmit, instead, given that we have aarch64 and x86_64 build jobs, we mostly trigger only aarch64 build and some x86_64 until we reach the 3000 files changed limit.
I see 2 main paths forward:
before
and after
refs from Push github event: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads#pushAdding 2 more possible "fixes":
driverkit/config
path is changed, and its entrypoint would diff HEAD~1
on master to manually check changed files and then trigger the correct job for each changed <distro,arch>. split our workflow to open up a PR on test-infra for each crawled distro; this does not require any change to Prow (so, much quicker fix), but at the same time, it means we will open (and have to check/approve) like 15PRs each week...
Run the update-kernels workflow daily instead of weekly: running it more frequently will generate less diff thus we would be more confident that all jobs get triggered; it has also the benefit that we would be much quicker to support new kernels
Note: these ones would not fix the "new driverversion added" issues.
A short term solution would be to change all post submit triggers to trigger to any change to driverkit/config/...
folder, ie from this:
run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'
to this:
run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'
In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.
PROs:
CONs:
Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods). The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).
@maxgio92 wdyt?
In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.
We could also let the entrypoint of each job do the diff HEAD~1
and decide whether it should do anything, even if i'd avoid adding too much logic in those scripts.
A short term solution would be to change all post submit triggers to trigger to any change to
driverkit/config/...
folder, ie from this:run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'
to this:
run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'
In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.
PROs:
* super easy solution
CONs:
* we would spawn useless pods * we would spawn more nodes than needed
Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods). The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).
@maxgio92 wdyt?
Thank you for this detailed investigation and sorry for my late response.
I'd try as hard as possible to avoid triggering all distros jobs.
What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example, @FedeDP?
PS: in any case IMHO are both short term solutions.
What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example
That would unfortunately not solve the issues when we add a completely new driver version from scratch.
PS: in any case IMHO are both short term solutions.
Agree btw :/
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Describe the bug
Not all driver building-postsubmit jobs have been triggered after new driverkit configurations have been introduced - as usual weekly crawling result from kernel-crawler.
How to reproduce it
Please check:
Expected behaviour
All the jobs for x86 to have been triggered accordingly.