apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.78k stars 1.15k forks source link

[URGENT] Reducing our usage of GitHub Runners #14376

Open lupyuen opened 5 days ago

lupyuen commented 5 days ago

Hi All: We have an ultimatum to reduce (drastically) our usage of GitHub Actions. Or our Continuous Integration will halt totally in Two Weeks. Here's what I'll implement within 24 hours for nuttx and nuttx-apps repos:

  1. When we submit or update a Complex PR that affects All Architectures (Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs. Previously CI Workflow will run arm-01 to arm-14, now we will run only arm-01 to arm-07. (This will reduce GitHub Cost by 32%)

  2. When the Complex PR is Merged: CI Workflow will still run all jobs arm-01 to arm-14

    (Simple PRs with One Single Arch / Board will build the same way as before: arm-01 to arm-14)

  3. For NuttX Admins: Our Merge Jobs are now at github.com/nuttxpr/nuttx. We shall have only Two Scheduled Merge Jobs per day

    I shall quickly Cancel any Merge Jobs that appear in nuttx and nuttx-apps repos. Then at 00:00 UTC and 12:00 UTC: I shall start the Latest Merge Job at nuttxpr. (This will reduce GitHub Cost by 17%)

  4. macOS and Windows Jobs (msys2 / msvc): They shall be totally disabled until we find a way to manage their costs. (GitHub charges 10x premium for macOS runners, 2x premium for Windows runners!)

    Let's monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.

    (This must be done for BOTH nuttx and nuttx-apps repos. Sadly the ASF Report for GitHub Runners doesn't break down the usage by repo, so we'll never know how much macOS and Windows Jobs are contributing to the cost. That's why we need https://github.com/apache/nuttx/pull/14377)

    (Wish I could run NuttX CI Jobs on my M2 Mac Mini. But the CI Script only supports Intel Macs sigh. Buy a Refurbished Intel Mac Mini?)

We have done an Analysis of CI Jobs over the past 24 hours:

https://docs.google.com/spreadsheets/d/1ujGKmUyy-cGY-l1pDBfle_Y6LKMsNp7o3rbfT1UkiZE/edit?gid=0#gid=0

Many CI Jobs are Incomplete: We waste GitHub Runners on jobs that eventually get superseded and cancelled

Screenshot 2024-10-17 at 1 18 14 PM

When we Half the CI Jobs: We reduce the wastage of GitHub Runners

Screenshot 2024-10-17 at 1 15 30 PM

Scheduled Merge Jobs will also reduce wastage of GitHub Runners, since most Merge Jobs don't complete (only 1 completed yesterday)

Screenshot 2024-10-17 at 1 16 16 PM

See the ASF Policy for GitHub Actions

lupyuen commented 4 days ago

As commented by @xiaoxiang781216:

can we reduce the board on Linux host to keep macOS/Windows? it's very easy to break these host if without these basic coverage.

I suggest that we monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.

raiden00pl commented 4 days ago

One of the methods proposed by, if I remember correctly @btashton, is to replace many simple configurations for some boards (mostly for peripherals testing) with one large jumbo config activating everything possible. This won't work for chips with low memory, but it will save some CI resources anyway.

lupyuen commented 4 days ago

@raiden00pl Yep I agree. Or we could test a complex target like board:lvgl?

lupyuen commented 4 days ago

Here's another comment about macOS and Windows by @yamt: https://github.com/apache/nuttx/pull/14377#issuecomment-2418914068

yamt commented 4 days ago

sorry, let me ask a dumb question. what plan are we using? https://github.com/pricing is apache paying for it?

lupyuen commented 4 days ago

what plan are we using? https://github.com/pricing

@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html

I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?

Update: More info here: https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners

If your project uses GitHub Actions, you share a queue with all other Apache projects using Github Actions, which can quickly lead to frustration for everyone involved. Builds can be stuck in "queued" for 6+ hours.

One option (if you want to stick with GitHub and don't want to use the Infra-managed Jenkins) is for your project to create its own self-hosted runners, which means your jobs will run on a virtual machine (VM) under your project's control. However this is not something to tackle lightly, as Infra will not manage or secure your VM - that is up to you.

Update 2: This sounds really complicated. I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?

yamt commented 4 days ago

what plan are we using? https://github.com/pricing

@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html

do you know if the macos/windows premium applies as usual? the policy page seems to have no mention about it.

I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?

yea, i guess projects have very different sizes/demands. (i feel nuttx is using too much anyway though :-)

TimJTi commented 4 days ago

...I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?

Is there any merit in "farming out" CI tests to those with boards? I think there was a discussion about NuttX owning a suite of boards but not sure where that got to - and would depend on just 1 or 2 people managing it.

As an aside, is there a guide to self-running CI? As I work on a custom board it would be good for me to do this occasionally but I have noi idea where to start!

lupyuen commented 4 days ago

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

TimJTi commented 4 days ago

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

jerpelea commented 4 days ago

[like] Jerpelea, Alin reacted to your message:


From: Tim Hardisty @.> Sent: Thursday, October 17, 2024 10:06:55 AM To: apache/nuttx @.> Cc: Subscribed @.***> Subject: Re: [apache/nuttx] [URGENT] Reducing our usage of GitHub Runners (Issue #14376)

@ TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https: //lupyuen. github. io/articles/sg2000a And I just RTFM. . . the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements

@TimJTihttps://urldefense.com/v3/__https://github.com/TimJTi__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l6JFU3cHg$ Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000ahttps://urldefense.com/v3/__https://lupyuen.github.io/articles/sg2000a__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l4xstpnoQ$

And I just RTFM...the "official" guide is herehttps://urldefense.com/v3/__https://nuttx.apache.org/docs/latest/guides/citests.html__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l7_blKYXg$ so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/apache/nuttx/issues/14376*issuecomment-2419108081__;Iw!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l7bo_4zVw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB32XCV6F2Y7L26ESNFQJK3Z36D37AVCNFSM6AAAAABQC44TO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGEYDQMBYGE__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l4uP3yEVw$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

michallenc commented 4 days ago

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

These work, but it does not describe the entire CI, just how to run pytest checks for sim:citest configuration.

cederom commented 4 days ago

Yes let's cut what we can (but to keep at least minimal functional configure, build, syntax testing) and see what are the cost reduction. We need to show Apache we are working on the problem. So far optimitzations did not cut the use and we are in danger of loosing all CI :-(

On the other hand that seems not fair to share the same CI quota as small projects. NuttX is a fully featured RTOS working on ~1000 different devices. In order to keep project code quality we need the CI.

Maybe its time to rethink / redesign from scratch the CI test architecture and implementation?

cederom commented 4 days ago

Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(

Some changes are sometimes required and we cannot avoid that this is part of the process. But maybe we can make something more "adaptive" so only minimal CI is launched by default, preferably only in area that was changed, then with all approvals we can make one manual trigger final big check before merge?

Long story short: We can switch CI test runs to manual trigger for now to see how it reduces costs. I would see two buttons to start Basic and Advanced (maybe also Full = current setup) CI.

lupyuen commented 4 days ago

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

cederom commented 4 days ago

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

People often cant fill even one single sentence to describe Summary, Impact, Testing :D This may be detected automatically.. or we can just see what architecture is the cheapest one and use it for all basic tests..?

raiden00pl commented 4 days ago

Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(

Often contributors use CI to test all configuration instead of testing changes locally. On one hand I understand this because compiling all configurations on a local machine takes a lot of time, on the other hand I'm not sure if CI is for this purpose (especially when we have limits on its use).

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

It won't work. Users are lazy, and in order to choose what needs to be compiled correctly, you need a comprehensive knowledge of the entire NuttX, which is not that easy. The only reasonable option is to automate this process.

cederom commented 4 days ago

So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O

jerpelea commented 4 days ago

[like] Jerpelea, Alin reacted to your message:


From: CeDeROM @.> Sent: Thursday, October 17, 2024 2:11:13 PM To: apache/nuttx @.> Cc: Jerpelea, Alin @.>; Comment @.> Subject: Re: [apache/nuttx] [URGENT] Reducing our usage of GitHub Runners (Issue #14376)

So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O — Reply to this email directly, view it on GitHub, or unsubscribe. You

So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/apache/nuttx/issues/14376*issuecomment-2419664709__;Iw!!JmoZiZGBv3RvKRSx!60hNhJMIXMMxTP8-Zr9RteOSJ2PJTdGpwx0nE8SOkWeV1d0uxP1v0N860U_WVI_zv-r-PhDE2T6b-zIlN3CrJpLbOg$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB32XCU22ONPLOEL6JKVC2LZ37AQDAVCNFSM6AAAAABQC44TO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGY3DINZQHE__;!!JmoZiZGBv3RvKRSx!60hNhJMIXMMxTP8-Zr9RteOSJ2PJTdGpwx0nE8SOkWeV1d0uxP1v0N860U_WVI_zv-r-PhDE2T6b-zIlN3DcSsTpzw$. You are receiving this because you commented.Message ID: @.***>

lupyuen commented 4 days ago

Stats for the past 24 hours: We consumed 61 Full-Time Runners, still got a long way away from our target of 25 Full-Time Runners (otherwise ASF will halt our servers in 12 days)

Screenshot 2024-10-18 at 6 14 48 AM

cederom commented 4 days ago

Okay its 0000UTC. We are really short on time. I have merged changes. Lets monitor the use now for 24h we need metrics. We can always revert the commits.

Looking at the pie chart 99.7% use comes from the builds, other tasks are barely visible. So we need to focus on the builds :-)

cederom commented 4 days ago

Sorry no clue why it closed in my name o_O

Ah, GH seems to close issues on its own when related PR gets merged. Probably this also happened before when Xiang merged prior PR :D

lupyuen commented 4 days ago

The builds are so much faster today yay! https://github.com/apache/nuttx/actions/runs/11395811301 Screenshot 2024-10-18 at 9 36 30 AM

michallenc commented 3 days ago

We can also disable CI checks for draft merge requests. I think it doesn't make much sense to run them as further commits/force pushes are expected.

lupyuen commented 3 days ago

Something That Bugs Me: Timeout Errors will cost us precious GitHub Minutes. The remaining jobs get killed, and restarting these remaining jobs from scratch will consume extra GitHub Minutes. (The restart below costs us 6 extra GitHub Runner Hours) (1) How do we retry these Timeout Errors? (2) Can we have Restartable Builds? Doesn't quite make sense to build everything from scratch (arm6, arm7, riscv7) just because one job failed (xtensa2) (3) Or xtensa2 should wait for others to finish, before it declares a timeout and dies? Hmmm...

Configuration/Tool: esp32s2-kaluga-1/lvgl_st7789
curl: (28) Failed to connect to github.com port 443 after 133994 ms: Connection timed out

https://github.com/apache/nuttx/actions/runs/11395811301/attempts/1

cederom commented 3 days ago

@lupyuen: (2) Can we have Restartable Builds? Doesn't quite make sense to build everything from scratch (arm6, arm7, riscv7) just because one job failed (xtensa2)

It is possible to restart only failed tasks on GitHub :-)

If you mean the task could restart where it left.. I am not sure its possible because underlying configuration could change after update to be verified and things need to be started from scratch? :-)

lupyuen commented 3 days ago

11 Days To Doomsday: But we're doing much better already! In the past 24 hours, we consumed 36 Full-Time GitHub Runners. We're getting closer to the ASF Target of 25 Full-Time Runners! Today we shall:

Hopefully we'll reach the ASF Target tomorrow, and ASF won't kill our servers no more! Thanks!

Screenshot 2024-10-19 at 7 15 11 AM

lupyuen commented 3 days ago

When NuttX merges our PR, the Merge Job won't run until 00:00 UTC and 12:00 UTC. How can we be really sure that our PR was merged correctly?

Let's create a GitHub Org (at no cost), fork the NuttX Repo and trigger the CI Workflow. (Which won't charge any extra GitHub Runner Minutes to NuttX Project!)

(I think this might also work if ASF shuts down our CI Servers. We can create many many orgs actually)

cederom commented 3 days ago

Sounds like a repo clone that will verify nuttx and nuttx-apps master independently twice a day?

lupyuen commented 3 days ago

@cederom You read my mind :-)

lupyuen commented 3 days ago

Hi All: Our Merge Jobs are now at github.com/nuttxpr/nuttx

Yesterday we spent One-Third of our GitHub Runner Minutes on Merge Jobs. This is not sustainable, so I moved them to nuttxpr repo. (Which won't be charged)

Screenshot 2024-10-19 at 11 33 46 AM

The data from yesterday shows that our Scheduled Merge Job keeps getting disrupted by newer Merged PRs. And when we restart a Scheduled Merge Job, we waste GitHub Minutes. (101 GitHub Hours for one single Scheduled Merge Job!)

Two-Thirds of our GitHub Runner Minutes were spent on Creating and Updating PRs. That's why we're skipping half the jobs today.

Hi @xiaoxiang781216 @GUIDINGLI @cederom @raiden00pl @acassis @jerpelea: With immediate effect, please see github.com/nuttxpr/nuttx for our Merge Jobs. I will trigger the jobs daily at 00:00 UTC and 12:00 UTC. I have given you Admin Access to nuttxpr in case you need to restart the jobs. Thanks!

cederom commented 3 days ago

Thank you @lupyuen !! :-) We have the https://github.com/nuttx organization too maybe we can make use of it too? :-)

lupyuen commented 2 days ago

10 Days to Shutdown: Or maybe not, because our GitHub Usage has dropped to 5 Full-Time Runners yay! Let's keep this below 25 Full-Time Runners, and make ASF super happy!

Thank you so much for your patience, let's keep this up! 🙏

Screenshot 2024-10-20 at 7 04 07 AM

lupyuen commented 2 days ago

We have the https://github.com/nuttx organization too maybe we can make use of it too? :-)

I think we learnt a Painful Lesson today: Freebies Won't Last Forever! The new GitHub Org for NuttX should probably be a Paid GitHub Org:

xiaoxiang781216 commented 2 days ago

@lupyuen could we select arm-01, arm-03, arm-05... instead of arm-01, arrm-02,... arm-07, which could improve the chip coverage?

lupyuen commented 2 days ago

ould we select arm-01, arm-03, arm-05... instead of arm-01, arrm-02,... arm-07, which could improve the chip coverage?

@xiaoxiang781216 Good idea. RP2040 is in arm-06, should we select arm-06 instead of arm-13?

So we build arm-01, arm-03, arm-05, arm-06, arm-07, arm-09, arm-11?

cederom commented 1 day ago

I am not sure if we should diverge from Apache that much, but its also sad to see zero Apache support when/where we need them most. We sent two support request emails to the warning with no response.

Having financial support from companies that use NuttX would be nice, but should we use it to pay GitHub or better pay freelancers to develop more NuttX code? GitHub benefits all here, while NuttX should.

Maybe we should look for alternatives to GitHub like GitLab where other OS runners are also possible as a complementary / backup solution to GitHub?

https://gitlab.com/nuttx <- @patacongo owns the repo.

And the independent decentralized testing farms seems the best alternative, so we keep only core stuff on GitHub to remain in CI quotas. But its a long road.

I just hope all this won't impact NuttX quality.

lupyuen commented 1 day ago

Yeah it doesn't sound right that an Unpaid Volunteer is monitoring our CI Servers 24 x 7 🤔 PXL_20241020_114213194

xiaoxiang781216 commented 1 day ago

ould we select arm-01, arm-03, arm-05... instead of arm-01, arrm-02,... arm-07, which could improve the chip coverage?

@xiaoxiang781216 Good idea. RP2040 is in arm-06, should we select arm-06 instead of arm-13?

So we build arm-01, arm-03, arm-05, arm-06, arm-07, arm-09, arm-11?

sure.

lupyuen commented 1 day ago

9 Days to Nirvana: Yesterday (quiet Sunday) we consumed 3 Full-Time GitHub Runners. Let's keep this below 25 Full-Time Runners, and we'll break free from suffering!

Screenshot 2024-10-21 at 6 07 32 AM

lupyuen commented 1 day ago

What if we could run the CI Jobs on our own Ubuntu PCs? Without any help from GitHub Actions?

I'm experimenting with a "Build Farm" at home (refurbished PC) that runs NuttX CI Jobs all day non-stop 24 x 7:

How does it work?

ci2-title

lupyuen commented 7 hours ago

8 Days to Diwali: Will our CI Servers go Dark? Sorry we're not sure, because the ASF Infra Reports are Down (sigh). But I think we briefly hit a peak of 21 Full-Time GitHub Runners, which is still within our target of 25 Full-Time Runners.

Screenshot 2024-10-22 at 6 07 29 AM

lupyuen commented 2 hours ago

ASF Infra Reports are still down. But now we have our own Live Metrics for Full-Time GitHub Runners! (reload for updates)

(Live Image) (Live Log)

This shows the number of Full-Time Runners for the Day, computed since 00:00 UTC. (Remember: We should keep this below 25) How it works:

xiaoxiang781216 commented 2 hours ago

@lupyuen the new number is very small, should we try restoring some macOS/msys2/windows ci?

lupyuen commented 2 hours ago

@xiaoxiang781216 Let's monitor for the rest of the day. Towards the end of the day, the number of Full-Time Runners will probably jump to 21. (Like yesterday)

We could mirror the NuttX Repo to another GitHub Org account and run the Windows and macOS Jobs there (so they won't add to our quota). We just need minor changes to build.yml and arch.yml: https://github.com/apache/nuttx/issues/14407 (at the bottom of the doc)

xiaoxiang781216 commented 2 hours ago

could we use this https://github.com/nuttx account, which is our official mirror?

lupyuen commented 2 hours ago

@xiaoxiang781216 The macOS and Windows Builds are now running in our NuttX Mirror: https://github.com/NuttX/nuttx/actions/runs/11452528962

I made 2 fixes to enable the macOS and Windows Builds: build.yml and arch.yml

Lemme figure out how to automate this 🤔

Update: This script will enable macOS and Windows Builds for our NuttX Mirror