apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.91k stars 1.18k forks source link

[FEATURE] CI Test should always terminate after 1 hour #14680

Closed lupyuen closed 1 week ago

lupyuen commented 3 weeks ago

Is your feature request related to a problem? Please describe.

CI Test will sometimes run for 6 hours (before getting killed by GitHub):

This is not so great because:

  1. It will increase our usage of GitHub Runners. Which may overrun the GitHub Actions Budget allocated by ASF.
  2. Suppose right after CI Test there's another build. If CI Test runs for all 6 hours, then the build after CI Test will never run.
  3. We are now running our own Ubuntu PCs as a NuttX Build Farm. The PCs will hang forever until we restart the Build Jobs.

Describe the solution you'd like

CI Test should complete within 1 hour. It should gracefully terminate itself (and report an error) if the runtime exceeds 1 hour.

Describe alternatives you've considered

Right now I'm manually killing all CI Jobs that run over 3 hours. And restarting the Ubuntu PCs in our NuttX Build Farm.

Verification

simbit18 commented 2 weeks ago

@lupyuen maybe we also need to put a maximum number of minutes for a job to run.

GitHub Actions timeout https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

lupyuen commented 2 weeks ago

@simbit18 Yep right now it quits after 6 hours: https://github.com/NuttX/nuttx/actions/runs/11714861244

simbit18 commented 2 weeks ago

@lupyuen so we should put

jobs:
  build:
    runs-on: ubuntu-latest
    timeout-minutes: 180 #  (3 hours) Decrease this timeout value as needed
lupyuen commented 2 weeks ago

@simbit18 Hmmm suppose right after CI Test there's another build. If CI Test runs for all 3 hours, then the build after CI Test will never run. So actually I prefer if CI Test could terminate itself (after 1 hour) and let other builds run.

Unless we always park CI Test at the end of the job?

simbit18 commented 2 weeks ago

@simbit18 Hmmm suppose right after CI Test there's another build. If CI Test runs for all 3 hours, then the build after CI Test will never run. So actually I prefer if CI Test could terminate itself (after 1 hour) and let other builds run. right !!!

Describe the solution you'd like
CI Test should complete within 1 hour. It should gracefully terminate itself (and report an error) if the runtime exceeds 1 hour.

This in my opinion is the right solution

The GitHub Actions timeout is only for safety and not to fall back into the tunnel https://github.com/apache/nuttx/issues/14376

lupyuen commented 1 week ago

It's happening again:

lupyuen commented 1 week ago

Wonder if this will work for GitHub CI? I'm testing it for macOS Build Farm: https://github.com/lupyuen/nuttx-build-farm/blob/main/run-job-macos.sh#L131-L144

## If CI Test Hangs: Kill it after 1 hour
( sleep 3600 ; echo Killing pytest... ; pkill -f pytest )&

## Run the CI Job
./cibuild.sh -i -c -A -R testlist/$job.dat
lupyuen commented 1 week ago

Yep this kills the CI Test after 2 hours! (Assuming our jobs are not supposed to exceed 2 hours)

We changed build.yml:

cd sources/nuttx/tools/ci
if [ "X${{matrix.boards}}" = "Xcodechecker" ]; then
  ./cibuild.sh -c -A -N -R --codechecker testlist/${{matrix.boards}}.dat
else
  ## Inserted this
  ( sleep 7200 ; echo Killing pytest... ; pkill -f pytest )&
  ./cibuild.sh -c -A -N -R -S testlist/${{matrix.boards}}.dat
fi

(Build Log says "Killing pytest... Terminated" and fails correctly later)