apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.75k stars 1.14k forks source link

[FEATURE] Bot that identifies Build Errors in the CI Logs #13827

Open lupyuen opened 5 days ago

lupyuen commented 5 days ago

Is your feature request related to a problem? Please describe.

Right now we spend a lot of time scanning the CI Build Logs to figure out why the build failed: Sample Log for arm-05

CMake Error at /github/workspace/sources/nuttx/build/_deps/mynewt-nimble-subbuild/mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/download-mynewt-nimble-populate.cmake:170 (message):
  Each download failed!
    error: downloading 'https://github.com/apache/mynewt-nimble/archive/fb15c844542e812ceb49ab5ac8502dc93c167b90.tar.gz' failed

What if our PR Bot could scan the CI Build Logs, identify the error and post as a PR Comment?

Hello! Your CI Build Failed because of this error: CMake Error at ... error: downloading ...

Describe the solution you'd like

How To Identify Errors

Our Bot will do this with diff, the Conventional Non-AI Way:

Optional: Explain the Error with LLM

This is totally optional: Our Bot could pass the error to an LLM and explain it...

LLM Prompt

What's failing in this NuttX Job for Continuous Integration? Please explain concisely

CMake Error at /github/workspace/sources/nuttx/build/_deps/mynewt-nimble-subbuild/mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/download-mynewt-nimble-populate.cmake:170 (message):
  Each download failed!
    error: downloading 'https://github.com/apache/mynewt-nimble/archive/fb15c844542e812ceb49ab5ac8502dc93c167b90.tar.gz' failed
          status_code: 28
          status_string: "Timeout was reached"
          log:
          --- LOG BEGIN ---
            Trying 140.82.114.3:443...
  Connection timed out after 30001 milliseconds
  Closing connection 0
          --- LOG END ---
FAILED: mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/mynewt-nimble-populate-download /github/workspace/sources/nuttx/build/_deps/mynewt-nimble-subbuild/mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/mynewt-nimble-populate-download 
Response from Gemini Pro 1.5

The CI job is failing because it's timing out while trying to download the mynewt-nimble source code from GitHub. This is likely due to a temporary network issue or GitHub server being overloaded.

Concisely: The CI job can't download the required mynewt-nimble source code due to a connection timeout.

Describe alternatives you've considered

No response

Verification

acassis commented 4 days ago

@lupyuen I think sometimes the download fails for random network instability. Maybe just adding an way to retry could fix the issue.

lupyuen commented 4 days ago

@acassis Yep great idea! I think we need a privileged account to retry the build, I won't think we should run bots with privileged accounts though 🤔

lupyuen commented 3 days ago

Wonder if this will work: Instead of a Bot, we add a Job to our CI, that will watch for Timeout Errors and retry the Failed Job: https://stackoverflow.com/a/78314483

name: Retry workflow
on:
    workflow_dispatch:
        inputs:
            run_id:
                required: true
jobs:
    rerun:
        runs-on: ubuntu-latest
        steps:
            - name: rerun ${{ inputs.run_id }}
              env:
                  GH_REPO: ${{ github.repository }}
                  GH_TOKEN: ${{ github.token }}
              run: |
                  ## TODO: Check for timeout errors
                  gh run watch ${{ inputs.run_id }} > /dev/null 2>&1
                  gh run rerun ${{ inputs.run_id }} --failed
simbit18 commented 3 days ago

perhaps more simply this might help

https://everything.curl.dev/usingcurl/downloads/retry.html

lupyuen commented 2 days ago

Any idea which script is calling curl and failing? We should update them to retry. Every day I need to click and manually re-run a few CI Jobs, this is getting tiring 😬

Update: Wonder if it's because we changed wget to curl: https://github.com/apache/nuttx/pull/13641 ? I think wget does Retry with Linear Backoff by default?

simbit18 commented 2 days ago

HI @lupyuen Which packages give errors? Which boards do errors occur?

this PR #13641 only concerns installation of dependencies and toolchains not for GITHUB for Ubuntu and generic Linux

lupyuen commented 2 days ago

@simbit18 Here are 2 curl errors from today: https://github.com/apache/nuttx/actions/runs/11229551377/job/31215370724

Configuration/Tool: icicle/rpmsg-sbi
curl: (28) Failed to connect to github.com port 443 after 136303 ms: Connection timed out
make[1]: *** [opensbi/Make.defs:52: opensbi.tar.gz] Error 28

https://github.com/apache/nuttx/actions/runs/11226642457/job/31210068982

Configuration/Tool: esp32-audio-kit/wifi
error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function.
fatal: protocol error: bad pack header

Update: One more from nuttx-apps, but it looks like a git error: https://github.com/apache/nuttx-apps/actions/runs/11226739564/job/31207828223

Configuration/Tool: esp32c3-generic/rmt
fatal: unable to access 'https://github.com/espressif/esp-hal-3rdparty.git/': Failed to connect to github.com port 443 after 133496 ms: Connection timed out

Another one from my repo (is it caused by curl?): https://github.com/lupyuen5/label-nuttx-apps/actions/runs/11230244326/job/31217304232

Configuration/Tool: waveshare-rp2040-lcd-1.28/lvgl,CONFIG_ARM_TOOLCHAIN_GNU_EABI
[v9.1.0.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of v9.1.0.zip or
        v9.1.0.zip.zip, and cannot find v9.1.0.zip.ZIP, period.
simbit18 commented 2 days ago

This https://github.com/apache/nuttx/actions/runs/11229551377/job/31215370724 is for opensbi.tar.gz https://github.com/apache/nuttx/blob/master/arch/risc-v/src/opensbi/Make.defs

on nuttx-apps adding the -S option will avoid problems with esp-hal-3rdparty.git PR apache/nuttx#13301

I think it is necessary to use the same logic with repeated packages that use curl