Open lupyuen opened 5 days ago
@lupyuen I think sometimes the download fails for random network instability. Maybe just adding an way to retry could fix the issue.
@acassis Yep great idea! I think we need a privileged account to retry the build, I won't think we should run bots with privileged accounts though 🤔
Wonder if this will work: Instead of a Bot, we add a Job to our CI, that will watch for Timeout Errors and retry the Failed Job: https://stackoverflow.com/a/78314483
name: Retry workflow
on:
workflow_dispatch:
inputs:
run_id:
required: true
jobs:
rerun:
runs-on: ubuntu-latest
steps:
- name: rerun ${{ inputs.run_id }}
env:
GH_REPO: ${{ github.repository }}
GH_TOKEN: ${{ github.token }}
run: |
## TODO: Check for timeout errors
gh run watch ${{ inputs.run_id }} > /dev/null 2>&1
gh run rerun ${{ inputs.run_id }} --failed
perhaps more simply this might help
Any idea which script is calling curl
and failing? We should update them to retry. Every day I need to click and manually re-run a few CI Jobs, this is getting tiring 😬
Update: Wonder if it's because we changed wget
to curl
: https://github.com/apache/nuttx/pull/13641 ? I think wget
does Retry with Linear Backoff by default?
HI @lupyuen Which packages give errors? Which boards do errors occur?
this PR #13641 only concerns installation of dependencies and toolchains not for GITHUB for Ubuntu and generic Linux
@simbit18 Here are 2 curl errors from today: https://github.com/apache/nuttx/actions/runs/11229551377/job/31215370724
Configuration/Tool: icicle/rpmsg-sbi
curl: (28) Failed to connect to github.com port 443 after 136303 ms: Connection timed out
make[1]: *** [opensbi/Make.defs:52: opensbi.tar.gz] Error 28
https://github.com/apache/nuttx/actions/runs/11226642457/job/31210068982
Configuration/Tool: esp32-audio-kit/wifi
error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function.
fatal: protocol error: bad pack header
Update: One more from nuttx-apps, but it looks like a git error: https://github.com/apache/nuttx-apps/actions/runs/11226739564/job/31207828223
Configuration/Tool: esp32c3-generic/rmt
fatal: unable to access 'https://github.com/espressif/esp-hal-3rdparty.git/': Failed to connect to github.com port 443 after 133496 ms: Connection timed out
Another one from my repo (is it caused by curl?): https://github.com/lupyuen5/label-nuttx-apps/actions/runs/11230244326/job/31217304232
Configuration/Tool: waveshare-rp2040-lcd-1.28/lvgl,CONFIG_ARM_TOOLCHAIN_GNU_EABI
[v9.1.0.zip]
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of v9.1.0.zip or
v9.1.0.zip.zip, and cannot find v9.1.0.zip.ZIP, period.
This https://github.com/apache/nuttx/actions/runs/11229551377/job/31215370724 is for opensbi.tar.gz https://github.com/apache/nuttx/blob/master/arch/risc-v/src/opensbi/Make.defs
on nuttx-apps adding the -S option will avoid problems with esp-hal-3rdparty.git PR apache/nuttx#13301
I think it is necessary to use the same logic with repeated packages that use curl
Is your feature request related to a problem? Please describe.
Right now we spend a lot of time scanning the CI Build Logs to figure out why the build failed: Sample Log for arm-05
What if our PR Bot could scan the CI Build Logs, identify the error and post as a PR Comment?
Describe the solution you'd like
How To Identify Errors
Our Bot will do this with
diff
, the Conventional Non-AI Way:arm-05
Build Log. Our Bot will search the CI Build Logs, to find the Last Successful Merge Build ofarm-05
: Last Successful arm-05 Logdiff
, to Compare the Failed Build with the Last Successful BuildCMake Error
, then post it as a PR CommentOptional: Explain the Error with LLM
This is totally optional: Our Bot could pass the error to an LLM and explain it...
LLM Prompt
What's failing in this NuttX Job for Continuous Integration? Please explain concisely
Response from Gemini Pro 1.5
The CI job is failing because it's timing out while trying to download the mynewt-nimble source code from GitHub. This is likely due to a temporary network issue or GitHub server being overloaded.
Concisely: The CI job can't download the required mynewt-nimble source code due to a connection timeout.
Describe alternatives you've considered
No response
Verification