Open vipulgupta2048 opened 3 years ago
Tried different connections, cli versions (previously v12.32), and base images (previously it was debian)
balena os download
works perfectly on x86 (my system, popOS 18.04) but not sure why it is having troubles inside a container.
@vipulgupta2048, these additional attempts you mentioned (other CLI versions and the Debian base image) -- did they also result in the download failing at around 83%? And, did you happen to try running the container on your laptop / desktop computer with Docker, instead of the Fin? I have just tested the following on my laptop (macOS):
$ docker run -it balenalib/intel-nuc-alpine-node:12-build /bin/bash
bash-5.0# npm install balena-cli@12.37.0 -g --production --unsafe-perm
bash-5.0# balena version -a
balena-cli version "12.37.0"
Node.js version "12.19.1"
bash-5.0# balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod
Getting device operating system for raspberrypi3
Downloading Device OS 2.58.3+rev1.prod [========================] 100% eta 0s
The image was downloaded successfully
bash-5.0# ls -l ~/.balena/cache
total 913412
-rw-r--r-- 1 root root 935329792 Jan 12 11:54 raspberrypi3-v2.58.3+rev1.prod.img
By the way, check the ls -l ~/.balena/cache
command above. Any downloaded file is cached (so it does not need to be downloaded again). I wonder if it could happen that initial download failed at 83%, then the file was kept instead of deleted. Although, probably not, as otherwise subsequent commands would simply print "The image was downloaded successfully" instead of failing at any percentage.
Anyway, I have tried os download
command 5 times in a balenalib/intel-nuc-alpine-node:12-build
container on my laptop, and they all succeeded. It's not clear what the problem would be.
And, did you happen to try running the container on your laptop / desktop computer with Docker, instead of the Fin?
No, I haven't since balena os download did work on a desktop. But, didn't try container on desktop. This is a bit randomly occurring as well. ~So, I will try it on my desktop container for about 5+ times (deleting the cache each time around) to see the results for sure.~ I see you already did the trials. Did you delete the cache every time too?
wonder if it could happen that initial download failed at 83%, then the file was kept instead of deleted.
Yes, after about running and retrying balena os download 50 or so times :sweat_smile: We have done the following and have some observations:
This has been working quite well for us at this point. But, yeah we aren't clear what the problem is as well. It's weird, very specific, and strangely elusive. Opened if anyone comes across this, googles and finds this issue.
Did you delete the cache every time too?
Yes. It sounds like the issue was only reproducible on the device.
@pdcastro I got someone else deploying the same image this week. Completely new, let's see if they face the same issue as well.
I'm running into this with a fresh install of CLI v12.44.23
on Raspberry Pi OS (no container). Will see if I can find a root cause.
pi@raspberrypi:~ $ node -v
v12.22.3
pi@raspberrypi:~ $ npm -v
6.14.13
pi@raspberrypi:~ $ docker version
Client: Docker Engine - Community
Version: 20.10.7
API version: 1.41
Go version: go1.13.15
Git commit: f0df350
Built: Wed Jun 2 11:57:27 2021
OS/Arch: linux/arm
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:55:25 2021
OS/Arch: linux/arm
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0
pi@raspberrypi:~ $ uname -a
Linux raspberrypi 5.10.17-v7l+ #1414 SMP Fri Apr 30 13:20:47 BST 2021 armv7l GNU/Linux
pi@raspberrypi:~ $ balena os download raspberry-pi -o balena.img --debug
[debug] new argv=[/home/pi/.nvm/versions/node/v12.22.3/bin/node,/home/pi/.nvm/versions/node/v12.22.3/bin/balena,os:download,raspberry-pi,-o,balena.img] length=6
Getting device operating system for raspberry-pi
OS version not specified: using latest stable version
Downloading Device OS [========================] 100% eta 26s
Error: unexpected end of file
at Zlib.zlibOnError [as onerror] (zlib.js:182:17)
I suspect this is related to arm32 and not containers specifically, since I reproduced in Raspberry Pi OS. I haven't found a root cause yet but I'm still digging.
I can reproduce this with balena-image-manager
but not with balena-sdk
directly, so at least I've narrowed it down some more.
const { getSdk } = require('balena-sdk');
const balena = getSdk({
apiUrl: "https://api.balena-cloud.com/",
dataDirectory: "/opt/local/balena"
});
const fs = require('fs');
const manager = require('balena-image-manager');
// works
// balena.models.os.download('raspberrypi3').then(function(stream) {
// stream.pipe(fs.createWriteStream('balena.img'));
// });
// fails
manager.get('raspberrypi3', 'default').then(function(stream) {
stream.pipe(fs.createWriteStream('balena.img'));
});
Error: unexpected end of file
at Zlib.zlibOnError [as onerror] (zlib.js:182:17)
Emitted 'error' event on Gunzip instance at:
at Gunzip.onerror (/home/pi/sdk-test/node_modules/readable-stream/lib/_stream_readable.js:640:52)
at Gunzip.emit (events.js:314:20)
at Zlib.zlibOnError [as onerror] (zlib.js:185:8) {
errno: -5,
code: 'Z_BUF_ERROR'
}
As far as I can tell, the issue arises with our use of PassThrough streams in balena-image-manager v7.0.3 (latest).
// Piping to a PassThrough stream is needed to be able
// to then pipe the stream to multiple destinations.
const pass = new stream.PassThrough();
// If we return `pass` directly, the client will not be able
// to read all data from it after a delay, since it will be
// instantly piped to `cacheStream`.
// The solution is to create yet another PassThrough stream,
// pipe to it and return the new stream instead.
const pass2 = new stream.PassThrough();
These transformers use buffers that have a fixed size, smaller on 32-bit systems.
# amd64
> (require('buffer').constants.MAX_LENGTH)
2147483647
# armv7
> (require('buffer').constants.MAX_LENGTH)
1073741823
In testing I found that even one of these transformers is enough to hit the limit, let alone the two connected pipes that we are using to download +800mb files.
If I just return imageStream
from the function above and remove all the the transformers I can download the image using a sample script.
const { getSdk } = require('balena-sdk');
const balena = getSdk({
apiUrl: "https://api.balena-cloud.com/",
dataDirectory: "/opt/local/balena"
});
const fs = require('fs');
const manager = require('balena-image-manager');
manager.cleanCache();
manager.get('raspberrypi3', 'default').then(function(stream) {
stream.pipe(fs.createWriteStream('balena.img'));
});
The only solution I can see as a node.js novice is to use a different streaming method that doesn't require buffering so much of the file.
Great findings @klutchell 👍
I don't doubt that there is a bug in that doDownload function, but I am less clear that it is because PassThrough
or Transform
streams are hitting the MAX_LENGTH buffer size. My understanding is that PassThrough
(and well implemented streams in general) may perform buffering up to the highWaterMark
level (an option at stream creation) that defaults to 16 KB -- see documentation on stream buffering and the highWaterMark default. 16KB is well clear of the 1GB MAX_LENGTH limit that you've uncovered for 32-bit implementations of Node.js.
For reference, the source code for PassThrough and Transform streams can be found at:
Perhaps we need to dig further in the implementation of doDownload
. There 4 streams there:
imageStream
- the network download sourcecacheStream
- a stream that appears to create a file on diskpass
pass2
When balena os download
hangs at 83%, what is the status of each of those streams? Which of them caused backpressure, if any? At that point, did the cacheStream manage to write a file on disk with the .inprogress
extension? Is the pass.on('end', cacheStream.persistCache)
event ever triggered?
@pdcastro I'm having trouble reproducing the issue with my vscode debugger attached to a remote Raspberry Pi OS device. I hope the debugger isn't changing something in the environment that would change the result?
vscode debugger attached to a remote Raspberry Pi OS device
I've never tried it. :-) If you can reproduce the error in a production installation, what I think I would do is the old school console.error()
logging by editing JS files directly in a global production install of the CLI. E.g.:
docker run -it debian:10
apt-get update && apt-get -y install curl python3 git make g++ vim
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash
/bin/bash
nvm install 12.22.4
npm install -g --production --unsafe-perm balena-cli
console.error()
strategically, e.g. in doDownload
-
vim /root/.nvm/versions/node/v12.22.4/lib/node_modules/balena-cli/node_modules/balena-image-manager/build/manager.js
balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod
Of course, any editing made with vim
as suggested above is not version controlled and would be lost if the CLI is reinstalled. But it is certainly flexible. :-)
By the way, to deeply inspect objects, Node.js provides an inspect function:
console.error(require('util').inspect(someObject));
I've spent far too long on this already, but here's an update:
cpulimit
slows the download, but it still fails at the same spotimageStream
of 2-10ms also slows the download a lot, but it still fails in the same spotSo CPU and memory on the device seem to be under control, yet the download fails at 83-85%.
When balena os download hangs at 83%, what is the status of each of those streams? Which of them caused backpressure, if any?
I was dumping all the stream states to the logs but none of them pointed to an issue. Sometimes they were corked/uncorked
and sometimes they were drained
but everything appeared to be operating per spec. It was also too much information to see an obvious pattern with my simple, monkey brain.
At that point, did the
cacheStream
manage to write a file on disk with the .inprogress extension?
Yeah, the .inprogress
file is always created and usually nearly the full size of the image at the time of failure, so data is streaming to disk for sure.
Is the
pass.on('end', cacheStream.persistCache)
event ever triggered?
No, in the case of failure I haven't seen be called, and the .inprogress
file does not get renamed.
This is not 100% reproducible, but in certain environments I can get it almost every test.
I strongly suspect that slow disk write speeds, like seen on SD cards, is a big part of the problem. I had hoped the artificial throttling would help with this but it seems not.
Is there anything specific I can look for in the stream states during failure?
Also why is it always between 83-85% when it fails, what happens at that time?
@vipulgupta2048 What do I need to have in .npmrc
? I'm getting the error BuildSecretMissingError: .npmrc
@zwhitchcox Don't understand in which context do you mean? What component are you trying to reproduce this with to get the error?
From my experience and pure guess, this error comes along when private npm packages are being installed and you would be needing a NPM API key in your npmrc working env to authorize their downloads. This was the case before when one of the packages we used was private called testbotSDK but that is no longer the case, we removed the need for .npmrc by making the SDK public.
I am running balenaCLI v12.37.0 (Installed globally through NPM) inside a container with fincm3-alpine-node as the base image. On our testing runs, we see the
balena os download
command flaking out about 8 out of 10 times.Expected Behavior
The image requested should download without errors.
Actual Behavior
On running the command inside the container,
balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod
the download proceeds normally and gets stuck strangely at 83% (Every time). The ETA time does keep increasing after eventually failing with the error mentioned below.We did some troubleshooting of our own. Tried different connections, cli versions (previously v12.32), and base images (previously it was debian)
balena os download
works perfectly on x86 (my system, popOS 18.04) but not sure why it is having troubles inside a container. Anything you suggest, I be most happy to try out. We have tried using--debug
flag as well but no change in output.Steps to Reproduce the Problem
This is the most important and helpful part of a bug report. If we cannot reproduce the problem, it is difficult to tell what the fix should be, or whether code changes have fixed it.
balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod
Specifications
Please do comment if other folks face this problem or are able to reproduce this issue for the team to start working on diagnosing this issue. Thanks!