balena-io / balena-cli

The official balena CLI tool.
Apache License 2.0
453 stars 138 forks source link

ECONNRESET: aborted when pushing large multi-container builds #2768

Open timwedde opened 3 months ago

timwedde commented 3 months ago

Expected Behavior

Pushing arbitrarily-sized multi-container builds to Balena builders works fine and creates a new image successfully.

Actual Behavior

When pushing large multi-container docker-compose files to the Balena builders, the push operations fails in about 90% of cases with the below error message:

ECONNRESET: aborted

Error: aborted
    at TLSSocket.socketCloseListener (node:_http_client:462:19)
    at TLSSocket.emit (node:events:532:35)
    at TLSSocket.emit (node:domain:488:12)
    at node:net:338:12
    at TCP.done (node:_tls_wrap:659:7)

The behavior is not consistent:

The command used to build is very simple:

balena push myFleet --release-tag description "debug" --draft

Here is one of the builds that failed, on the machine that has a slightly higher success rate:

❯ balena push myFleet --release-tag description "debug" --draft --debug
----------------------------------------------------------------------
[Warn] Node.js version "22.2.0" does not satisfy requirement "^20.6.0"
[Warn] This may cause unexpected behavior.
----------------------------------------------------------------------
[debug] new argv=[/opt/homebrew/Cellar/node/22.2.0/bin/node,/opt/homebrew/bin/balena,push,jetson-test,--release-tag,description,lpm debug,--draft] length=8
[debug] Deprecation check: 6.81196 days since last npm registry query for next major version release date.
[debug] Will not query the registry again until at least 7 days have passed.
[Debug]   Using build source directory: . 
(node:28123) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
[Debug]   Pushing to cloud for fleet: myFleet
[debug] Event tracking error: Timeout awaiting 'response' for 0ms
| Packaging the project source...[Debug]   Tarring all non-ignored files...
[Debug]   docker-compose.yml file found at "/Users/user/Documents/Work/project"
/ Packaging the project source...[Debug]   Tarring complete in 353 ms
[debug] Connecting to builder at https://builder.balena-cloud.com/v3/build?slug=gh_timwedde%2Fjetson-test&dockerfilePath=&emulated=false&nocache=false&headless=false&isdraft=true
\ Uploading source package to https://builder.balena-cloud.com[debug] received HTTP 200 OK
[debug] handling message: {"type":"metadata","resource":"buildLogId","value":"3047436"}
[debug] handling message: {"message":"\u001b[36m[Info]\u001b[39m         Starting build for myFleet, user gh_timwedde"}
[Info]         Starting build for myFleet, user gh_timwedde
[debug] handling message: {"message":"\u001b[36m[Info]\u001b[39m         Dashboard link: https://dashboard.balena-cloud.com/apps/ID/devices"}
[Info]         Dashboard link: https://dashboard.balena-cloud.com/apps/ID/devices
ECONNRESET: aborted

Error: aborted
    at TLSSocket.socketCloseListener (node:_http_client:462:19)
    at TLSSocket.emit (node:events:532:35)
    at TLSSocket.emit (node:domain:488:12)
    at node:net:338:12
    at TCP.done (node:_tls_wrap:659:7)

For further help or support, visit:
https://www.balena.io/docs/reference/balena-cli/#support-faq-and-troubleshooting

[debug] Timeout reporting error to sentry.io

Steps to Reproduce the Problem

Hard to say, I don't know if this is generally reproducible. This seems to occur with larger multi-container builds though. My particular one is massive (in terms of final Docker image sizes at least), ending up at about 40-50GB. This is bad and I'm aware of that, but since I'm building for a Jetson and need multiple distinct containers that make use of GPU acceleration, I have to ship the entire driver stack several times, which bloats image sizes by a lot. I'm assuming I'm getting kicked off the builders because of cache or image sizes, but the error message is not clear about this nor could I find any hard limits on this, so I'm a bit confused as to the source of the issue.

Specifications

joshuaxdmb commented 3 months ago

Getting the same issue here on Apple M2 Pro. balena push suddenly stopped working a few weeks ago. I've been using balena build andbalena deploy ever since.

timwedde commented 3 months ago

Intermediary status update: This is still an issue for me.

timwedde commented 3 months ago

Intermediary status update: This is still an issue for me. It's pretty bad right now, about 80% of my push attempts fail, leading to a lot of wasted time just endlessly redoing the command until it eventually decides to work every once in a while. Alas, building and pushing locally is also prohibitive because of the large container sizes (and seemingly no delta pushes with the local method) due to my somewhat slow internet.

timwedde commented 2 months ago

A new bit of information emerges: It seems that when Balena CLI tells me that the build aborted due to a connection drop/reset, it still ends up on Balena Cloud. Seemingly it keeps running, but since I lose connection to the builder, I'm unable to see any logs. Release tags are also not applied (presumably because this happens after the build completes), so it's a weird half-state of kinda working, but not really. Would be nice if this behavior was consistent, seeing as it's one of the fundamental capabilities of the platform. I have not yet been able to test whether a build that completes in this manner is actually capable of running on a device or not.

Edit: The 'phantom build' seems to be stuck in the 'Running' state forever and never finishes, so I guess that's not really useful.

timwedde commented 2 months ago

Intermediary status update: This is still an issue for me. Is anybody actually triaging these? The repo has been filled with automatic dependency-bump PR's for months, with little to no human activity in the mix. Same goes for a lot of the issues here: Most of them have no replies at all and if they do, it's often other users and not anybody from Balena. Is there a better way to actually reach a human about these issues? Currently it feels bit like the dead internet theory, just scoped to Balena's GitHub organization.

I suspect this is an issue with the builder system, so making a PR to fix this behavior is next to impossible. If I can find some time I'll try and dig through the source code myself, though I expect it'll be challenging without any help and if it is in the build system itself, then we're kind powerless here.

otaviojacobi commented 2 months ago

Hello @timwedde I am sorry you are having issues with our builders and yes, we have other people reporting similar on the forums.

We currently use the builders for building our own docker images which are fairly large and we never faced this issue (our images have no priority, we just use the same build system as you do). So I don't think directly this is an issue with image sizes, but rather, something specific on a few docker composes that can cause the intermitency.

I also finished running a script that did 100 pushes of different images (with different sizes) and I could not reproduce, is there anyway you have an example of dockercompose + resources where you can reproduce the behaviour?

timwedde commented 2 months ago

Yup, will work on creating a reproducible example that I am able to share, I'll post here again once I have something! Thanks for responding, much appreciated :)

timwedde commented 2 months ago

Sorry for the long absence, things got rather busy at work for a little bit so I didn't have time to work on an MWE for this. I have to say though, recently pushed have been more 'flaky', but in a good way: The builds go through more often than they used to, which is already nice. We're also now migrating away from Balena, so unfortunately this has been pushed down on the list of priorities. I'll write again if I can reproduce this reliably, but at the current point in time -while it's definitely not fixed- it works well enough to survive the migration, at least.