Open lupyuen opened 5 days ago
As commented by @xiaoxiang781216:
can we reduce the board on Linux host to keep macOS/Windows? it's very easy to break these host if without these basic coverage.
I suggest that we monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.
One of the methods proposed by, if I remember correctly @btashton, is to replace many simple configurations for some boards (mostly for peripherals testing) with one large jumbo
config activating everything possible.
This won't work for chips with low memory, but it will save some CI resources anyway.
@raiden00pl Yep I agree. Or we could test a complex target like board:lvgl
?
Here's another comment about macOS and Windows by @yamt: https://github.com/apache/nuttx/pull/14377#issuecomment-2418914068
sorry, let me ask a dumb question. what plan are we using? https://github.com/pricing is apache paying for it?
what plan are we using? https://github.com/pricing
@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html
I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?
Update: More info here: https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners
If your project uses GitHub Actions, you share a queue with all other Apache projects using Github Actions, which can quickly lead to frustration for everyone involved. Builds can be stuck in "queued" for 6+ hours.
One option (if you want to stick with GitHub and don't want to use the Infra-managed Jenkins) is for your project to create its own self-hosted runners, which means your jobs will run on a virtual machine (VM) under your project's control. However this is not something to tackle lightly, as Infra will not manage or secure your VM - that is up to you.
Update 2: This sounds really complicated. I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?
what plan are we using? https://github.com/pricing
@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html
do you know if the macos/windows premium applies as usual? the policy page seems to have no mention about it.
I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?
yea, i guess projects have very different sizes/demands. (i feel nuttx is using too much anyway though :-)
...I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?
Is there any merit in "farming out" CI tests to those with boards? I think there was a discussion about NuttX owning a suite of boards but not sure where that got to - and would depend on just 1 or 2 people managing it.
As an aside, is there a guide to self-running CI? As I work on a custom board it would be good for me to do this occasionally but I have noi idea where to start!
@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a
@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a
And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation
[like] Jerpelea, Alin reacted to your message:
From: Tim Hardisty @.> Sent: Thursday, October 17, 2024 10:06:55 AM To: apache/nuttx @.> Cc: Subscribed @.***> Subject: Re: [apache/nuttx] [URGENT] Reducing our usage of GitHub Runners (Issue #14376)
@ TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https: //lupyuen. github. io/articles/sg2000a And I just RTFM. . . the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements
@TimJTihttps://urldefense.com/v3/__https://github.com/TimJTi__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l6JFU3cHg$ Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000ahttps://urldefense.com/v3/__https://lupyuen.github.io/articles/sg2000a__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l4xstpnoQ$
And I just RTFM...the "official" guide is herehttps://urldefense.com/v3/__https://nuttx.apache.org/docs/latest/guides/citests.html__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l7_blKYXg$ so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/apache/nuttx/issues/14376*issuecomment-2419108081__;Iw!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l7bo_4zVw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB32XCV6F2Y7L26ESNFQJK3Z36D37AVCNFSM6AAAAABQC44TO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGEYDQMBYGE__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l4uP3yEVw$. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a
And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation
These work, but it does not describe the entire CI, just how to run pytest checks for sim:citest
configuration.
Yes let's cut what we can (but to keep at least minimal functional configure, build, syntax testing) and see what are the cost reduction. We need to show Apache we are working on the problem. So far optimitzations did not cut the use and we are in danger of loosing all CI :-(
On the other hand that seems not fair to share the same CI quota as small projects. NuttX is a fully featured RTOS working on ~1000 different devices. In order to keep project code quality we need the CI.
Maybe its time to rethink / redesign from scratch the CI test architecture and implementation?
Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(
Some changes are sometimes required and we cannot avoid that this is part of the process. But maybe we can make something more "adaptive" so only minimal CI is launched by default, preferably only in area that was changed, then with all approvals we can make one manual trigger final big check before merge?
Long story short: We can switch CI test runs to manual trigger for now to see how it reduces costs. I would see two buttons to start Basic and Advanced (maybe also Full = current setup) CI.
@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh
. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make
. Before starting the whole CI Build?
@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g.
rv-virt:nsh
. Then the CI Workflow should dotools/configure.sh rv-virt:nsh && make
. Before starting the whole CI Build?
People often cant fill even one single sentence to describe Summary, Impact, Testing :D This may be detected automatically.. or we can just see what architecture is the cheapest one and use it for all basic tests..?
Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(
Often contributors use CI to test all configuration instead of testing changes locally. On one hand I understand this because compiling all configurations on a local machine takes a lot of time, on the other hand I'm not sure if CI is for this purpose (especially when we have limits on its use).
@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?
It won't work. Users are lazy, and in order to choose what needs to be compiled correctly, you need a comprehensive knowledge of the entire NuttX, which is not that easy. The only reasonable option is to automate this process.
So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O
[like] Jerpelea, Alin reacted to your message:
From: CeDeROM @.> Sent: Thursday, October 17, 2024 2:11:13 PM To: apache/nuttx @.> Cc: Jerpelea, Alin @.>; Comment @.> Subject: Re: [apache/nuttx] [URGENT] Reducing our usage of GitHub Runners (Issue #14376)
So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O — Reply to this email directly, view it on GitHub, or unsubscribe. You
So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/apache/nuttx/issues/14376*issuecomment-2419664709__;Iw!!JmoZiZGBv3RvKRSx!60hNhJMIXMMxTP8-Zr9RteOSJ2PJTdGpwx0nE8SOkWeV1d0uxP1v0N860U_WVI_zv-r-PhDE2T6b-zIlN3CrJpLbOg$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB32XCU22ONPLOEL6JKVC2LZ37AQDAVCNFSM6AAAAABQC44TO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGY3DINZQHE__;!!JmoZiZGBv3RvKRSx!60hNhJMIXMMxTP8-Zr9RteOSJ2PJTdGpwx0nE8SOkWeV1d0uxP1v0N860U_WVI_zv-r-PhDE2T6b-zIlN3DcSsTpzw$. You are receiving this because you commented.Message ID: @.***>
Stats for the past 24 hours: We consumed 61 Full-Time Runners, still got a long way away from our target of 25 Full-Time Runners (otherwise ASF will halt our servers in 12 days)
nuttx-apps
has stopped macOS and Windows Jobs. But not much impact, since we don't compile nuttx-apps
often nuttx
repo to stop macOS and Windows Jobs (Update: merged!) nuttx
repo to Halve The Jobs (Update: merged!) nuttx-apps
to Halve The Jobs (probably not much impact, since we don't compile nuttx-apps
often) (Update: merged!) Okay its 0000UTC. We are really short on time. I have merged changes. Lets monitor the use now for 24h we need metrics. We can always revert the commits.
Looking at the pie chart 99.7% use comes from the builds, other tasks are barely visible. So we need to focus on the builds :-)
Sorry no clue why it closed in my name o_O
Ah, GH seems to close issues on its own when related PR gets merged. Probably this also happened before when Xiang merged prior PR :D
The builds are so much faster today yay! https://github.com/apache/nuttx/actions/runs/11395811301
We can also disable CI checks for draft merge requests. I think it doesn't make much sense to run them as further commits/force pushes are expected.
Something That Bugs Me: Timeout Errors will cost us precious GitHub Minutes. The remaining jobs get killed, and restarting these remaining jobs from scratch will consume extra GitHub Minutes. (The restart below costs us 6 extra GitHub Runner Hours) (1) How do we retry these Timeout Errors? (2) Can we have Restartable Builds? Doesn't quite make sense to build everything from scratch (arm6, arm7, riscv7) just because one job failed (xtensa2) (3) Or xtensa2 should wait for others to finish, before it declares a timeout and dies? Hmmm...
Configuration/Tool: esp32s2-kaluga-1/lvgl_st7789
curl: (28) Failed to connect to github.com port 443 after 133994 ms: Connection timed out
https://github.com/apache/nuttx/actions/runs/11395811301/attempts/1
@lupyuen: (2) Can we have Restartable Builds? Doesn't quite make sense to build everything from scratch (arm6, arm7, riscv7) just because one job failed (xtensa2)
It is possible to restart only failed tasks on GitHub :-)
If you mean the task could restart where it left.. I am not sure its possible because underlying configuration could change after update to be verified and things need to be started from scratch? :-)
11 Days To Doomsday: But we're doing much better already! In the past 24 hours, we consumed 36 Full-Time GitHub Runners. We're getting closer to the ASF Target of 25 Full-Time Runners! Today we shall:
Halve the Jobs for RISC-V, Xtensa and Simulator for Complex PRs
https://github.com/apache/nuttx/pull/14400
Do the same for nuttx-apps
repo
https://github.com/apache/nuttx-apps/pull/2758
Our Merge Jobs are now at github.com/nuttxpr/nuttx
Reduce the Scheduled Merge Jobs to Two Per Day at 00:00 / 12:00 UTC (down from Four Per Day)
Hopefully we'll reach the ASF Target tomorrow, and ASF won't kill our servers no more! Thanks!
When NuttX merges our PR, the Merge Job won't run until 00:00 UTC and 12:00 UTC. How can we be really sure that our PR was merged correctly?
Let's create a GitHub Org (at no cost), fork the NuttX Repo and trigger the CI Workflow. (Which won't charge any extra GitHub Runner Minutes to NuttX Project!)
(I think this might also work if ASF shuts down our CI Servers. We can create many many orgs actually)
Sounds like a repo clone that will verify nuttx and nuttx-apps master independently twice a day?
@cederom You read my mind :-)
Hi All: Our Merge Jobs are now at github.com/nuttxpr/nuttx
Yesterday we spent One-Third of our GitHub Runner Minutes on Merge Jobs. This is not sustainable, so I moved them to nuttxpr
repo. (Which won't be charged)
The data from yesterday shows that our Scheduled Merge Job keeps getting disrupted by newer Merged PRs. And when we restart a Scheduled Merge Job, we waste GitHub Minutes. (101 GitHub Hours for one single Scheduled Merge Job!)
Two-Thirds of our GitHub Runner Minutes were spent on Creating and Updating PRs. That's why we're skipping half the jobs today.
Hi @xiaoxiang781216 @GUIDINGLI @cederom @raiden00pl @acassis @jerpelea: With immediate effect, please see github.com/nuttxpr/nuttx for our Merge Jobs. I will trigger the jobs daily at 00:00 UTC and 12:00 UTC. I have given you Admin Access to nuttxpr
in case you need to restart the jobs. Thanks!
Thank you @lupyuen !! :-) We have the https://github.com/nuttx organization too maybe we can make use of it too? :-)
10 Days to Shutdown: Or maybe not, because our GitHub Usage has dropped to 5 Full-Time Runners yay! Let's keep this below 25 Full-Time Runners, and make ASF super happy!
Today we split the Build Jobs for __Arm64 and x86_64__ from other
, making them faster (Hello Goldfish :-)
Our Scheduled Merge Jobs are now at github.com/nuttxpr/nuttx, which I trigger manually at 00:00 UTC and 12:00 UTC daily. I'm still running this script, forever killing any Merge Jobs at nuttx
and nuttx-apps
repos.
Excellent Initiative by @raiden00pl: We Merge Multiple Targets into One Target, and reduce the build time
Here's the Email I sent to ASF Infra Team (quite odd they don't respond to my emails, are my emails going into a black hole?)
Then Again: Yesterday was quiet, with few PRs. (But it's great for analysing the data!)
Thank you so much for your patience, let's keep this up! 🙏
We have the https://github.com/nuttx organization too maybe we can make use of it too? :-)
I think we learnt a Painful Lesson today: Freebies Won't Last Forever! The new GitHub Org for NuttX should probably be a Paid GitHub Org:
@lupyuen could we select arm-01, arm-03, arm-05... instead of arm-01, arrm-02,... arm-07, which could improve the chip coverage?
I am not sure if we should diverge from Apache that much, but its also sad to see zero Apache support when/where we need them most. We sent two support request emails to the warning with no response.
Having financial support from companies that use NuttX would be nice, but should we use it to pay GitHub or better pay freelancers to develop more NuttX code? GitHub benefits all here, while NuttX should.
Maybe we should look for alternatives to GitHub like GitLab where other OS runners are also possible as a complementary / backup solution to GitHub?
https://gitlab.com/nuttx <- @patacongo owns the repo.
And the independent decentralized testing farms seems the best alternative, so we keep only core stuff on GitHub to remain in CI quotas. But its a long road.
I just hope all this won't impact NuttX quality.
Yeah it doesn't sound right that an Unpaid Volunteer is monitoring our CI Servers 24 x 7 🤔
9 Days to Nirvana: Yesterday (quiet Sunday) we consumed 3 Full-Time GitHub Runners. Let's keep this below 25 Full-Time Runners, and we'll break free from suffering!
No major changes for today, we'll watch and monitor
Scheduled Merge Jobs are still at nuttxpr/nuttx
Once in a while: I'll run Ad Hoc Merge Jobs for NuttX Apps at nuttxpr/nuttx-apps (Kinda redundant because nuttx
will also compile nuttx-apps
)
Today is Monday, we expect the load to increase. Hoping for the best, thanks everyone!
What if we could run the CI Jobs on our own Ubuntu PCs? Without any help from GitHub Actions?
I'm experimenting with a "Build Farm" at home (refurbished PC) that runs NuttX CI Jobs all day non-stop 24 x 7:
master
branch of nuttx
, run CI Job arm-01
arm-01
to complete (roughly 1.5 hours)master
branch of nuttx
, run CI Job arm-02
arm-02
to complete (roughly 1.5 hours)arm-14
, then loop back to arm-01
How does it work?
arm-01
to arm-14
, running the job, searching for errors and uploading the logs8 Days to Diwali: Will our CI Servers go Dark? Sorry we're not sure, because the ASF Infra Reports are Down (sigh). But I think we briefly hit a peak of 21 Full-Time GitHub Runners, which is still within our target of 25 Full-Time Runners.
Simulator CI Job sim-01
has become the bottleneck. Today We'll split sim-01
and add sim-03
:
Since The ASF Reports Are Down: I'll try calling the GitHub API to fetch the Elapsed Duration of every job. Then extrapolate to GitHub Runner Minutes.
(Done! Here's the script to compute Full-Time GitHub Runners)
ASF Infra Reports are still down. But now we have our own Live Metrics for Full-Time GitHub Runners! (reload for updates)
This shows the number of Full-Time Runners for the Day, computed since 00:00 UTC. (Remember: We should keep this below 25) How it works:
@lupyuen the new number is very small, should we try restoring some macOS/msys2/windows ci?
@xiaoxiang781216 Let's monitor for the rest of the day. Towards the end of the day, the number of Full-Time Runners will probably jump to 21. (Like yesterday)
We could mirror the NuttX Repo to another GitHub Org account and run the Windows and macOS Jobs there (so they won't add to our quota). We just need minor changes to build.yml
and arch.yml
: https://github.com/apache/nuttx/issues/14407 (at the bottom of the doc)
could we use this https://github.com/nuttx account, which is our official mirror?
@xiaoxiang781216 The macOS and Windows Builds are now running in our NuttX Mirror: https://github.com/NuttX/nuttx/actions/runs/11452528962
I made 2 fixes to enable the macOS and Windows Builds: build.yml and arch.yml
Lemme figure out how to automate this 🤔
Update: This script will enable macOS and Windows Builds for our NuttX Mirror
Hi All: We have an ultimatum to reduce (drastically) our usage of GitHub Actions. Or our Continuous Integration will halt totally in Two Weeks. Here's what I'll implement within 24 hours for
nuttx
andnuttx-apps
repos:When we submit or update a Complex PR that affects All Architectures (Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs. Previously CI Workflow will run
arm-01
toarm-14
, now we will run onlyarm-01
toarm-07
. (This will reduce GitHub Cost by 32%)When the Complex PR is Merged: CI Workflow will still run all jobs
arm-01
toarm-14
(Simple PRs with One Single Arch / Board will build the same way as before:
arm-01
toarm-14
)For NuttX Admins: Our Merge Jobs are now at github.com/nuttxpr/nuttx. We shall have only Two Scheduled Merge Jobs per day
I shall quickly Cancel any Merge Jobs that appear in
nuttx
andnuttx-apps
repos. Then at 00:00 UTC and 12:00 UTC: I shall start the Latest Merge Job atnuttxpr
.(This will reduce GitHub Cost by 17%)macOS and Windows Jobs (msys2 / msvc): They shall be totally disabled until we find a way to manage their costs. (GitHub charges 10x premium for macOS runners, 2x premium for Windows runners!)
Let's monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.
(This must be done for BOTH
nuttx
andnuttx-apps
repos. Sadly the ASF Report for GitHub Runners doesn't break down the usage by repo, so we'll never know how much macOS and Windows Jobs are contributing to the cost. That's why we need https://github.com/apache/nuttx/pull/14377)(Wish I could run NuttX CI Jobs on my M2 Mac Mini. But the CI Script only supports Intel Macs sigh. Buy a Refurbished Intel Mac Mini?)
We have done an Analysis of CI Jobs over the past 24 hours:
https://docs.google.com/spreadsheets/d/1ujGKmUyy-cGY-l1pDBfle_Y6LKMsNp7o3rbfT1UkiZE/edit?gid=0#gid=0
Many CI Jobs are Incomplete: We waste GitHub Runners on jobs that eventually get superseded and cancelled
When we Half the CI Jobs: We reduce the wastage of GitHub Runners
Scheduled Merge Jobs will also reduce wastage of GitHub Runners, since most Merge Jobs don't complete (only 1 completed yesterday)
See the ASF Policy for GitHub Actions