balena-io / deploy-to-balena-action

Official Github action to deploy releases to balenaCloud environments
Apache License 2.0
38 stars 13 forks source link

Build fails with Error: ECONNRESET: aborted #315

Open treygilliland opened 8 months ago

treygilliland commented 8 months ago

Occasionally the deploy-to-balena-action will fail with the error message Error: ECONNRESET: aborted despite the build being marked as succeeded on the Balena dashboard. We run several build jobs and randomly 1 of them will flake with this error. Also occasionally, it seems one of these flakes will cause all of the other jobs to flake as well.

Here are the logs:

[service]         Successfully built 91063e1d9bd8

Error: ECONNRESET: aborted

Error: 
Error: aborted
    at connResetException (node:internal/errors:704:14)
    at TLSSocket.socketCloseListener (node:_http_client:425:19)
    at TLSSocket.emit (node:events:549:35)
    at TLSSocket.emit (node:domain:482:12)
    at node:net:747:14

Error:     at TCP.done (node:_tls_wrap:582:7)

Error: 
For further help or support, visit:
https://www.balena.io/docs/reference/balena-cli/#support-faq-and-troubleshooting

Our fix for this has always been to just rerun the individual job which flaked which will eventually work most of the time but sometimes it doesn't which requires rerunning the action as a whole.

klutchell commented 8 months ago

Hey @treygilliland , is there a specific builder that gives you the most trouble? The builder ID is usually printed at the start of the build logs.

Does it happen more with some projects than others? In the past we saw this with larger compose files with many services and it would be interesting to know if that was still a factor.

treygilliland commented 8 months ago

@klutchell it is a compose file with 5-6 services, 2 of them are ~300 MB, 2 of them are ~2GB and one is ~16GB but normally our builds make good use of the cache so builds stay around 3-5 minutes.

I don't know if it is a specific builder but we are building for the NVIDIA Jetson Orin if that helps. The builder ID for this particular failure is 4b2d7d0 but it seems to fail on the GH action side and pass on the Balena dashboard.

treygilliland commented 8 months ago

It just happened again with 5 of our 6 build jobs, only 1 succeeded. This time it aborted very early on (while the cache images were still downloading). It may be related to running multiple builds at once for the same fleet but it is hard to tell. The build has the 'running...' status on the balena dashboard despite the build triggered by the retry finishing before it.

Screenshot 2024-01-26 at 12 28 02 PM
klutchell commented 8 months ago

Hey @treygilliland , is this still happening with the latest release of the action?