balena-io / balena-cli

The official balena CLI tool.
Apache License 2.0
454 stars 139 forks source link

balena os download hangs on 83%, errors out with unexpected end of file at Zlib.zlibOnError #2152

Open vipulgupta2048 opened 3 years ago

vipulgupta2048 commented 3 years ago

I am running balenaCLI v12.37.0 (Installed globally through NPM) inside a container with fincm3-alpine-node as the base image. On our testing runs, we see the balena os download command flaking out about 8 out of 10 times.

Expected Behavior

The image requested should download without errors.

Actual Behavior

On running the command inside the container, balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod the download proceeds normally and gets stuck strangely at 83% (Every time). The ETA time does keep increasing after eventually failing with the error mentioned below.

bash-5.0# balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod
Getting device operating system for raspberrypi3
Downloading Device OS 2.58.3+rev1.prod [========================] 100% eta 5s  
Error: unexpected end of file
    at Zlib.zlibOnError [as onerror] (zlib.js:180:17)

We did some troubleshooting of our own. Tried different connections, cli versions (previously v12.32), and base images (previously it was debian) balena os download works perfectly on x86 (my system, popOS 18.04) but not sure why it is having troubles inside a container. Anything you suggest, I be most happy to try out. We have tried using --debug flag as well but no change in output.

Steps to Reproduce the Problem

This is the most important and helpful part of a bug report. If we cannot reproduce the problem, it is difficult to tell what the fix should be, or whether code changes have fixed it.

  1. Dockerfile.template and docker-compose.yml here for the container
  2. Deployed on a provisioned balenaFin v1.1.1 with balena push
  3. Run the command on the container, balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod

Specifications

Please do comment if other folks face this problem or are able to reproduce this issue for the team to start working on diagnosing this issue. Thanks!

pdcastro commented 3 years ago

Tried different connections, cli versions (previously v12.32), and base images (previously it was debian) balena os download works perfectly on x86 (my system, popOS 18.04) but not sure why it is having troubles inside a container.

@vipulgupta2048, these additional attempts you mentioned (other CLI versions and the Debian base image) -- did they also result in the download failing at around 83%? And, did you happen to try running the container on your laptop / desktop computer with Docker, instead of the Fin? I have just tested the following on my laptop (macOS):

$ docker run -it balenalib/intel-nuc-alpine-node:12-build /bin/bash

bash-5.0# npm install balena-cli@12.37.0 -g --production --unsafe-perm

bash-5.0# balena version -a
balena-cli version "12.37.0"
Node.js version "12.19.1"

bash-5.0# balena os download raspberrypi3 -o /tmp/raspberrypi3.img --version v2.58.3+rev1.prod
Getting device operating system for raspberrypi3
Downloading Device OS 2.58.3+rev1.prod [========================] 100% eta 0s
The image was downloaded successfully

bash-5.0# ls -l ~/.balena/cache
total 913412
-rw-r--r--    1 root     root     935329792 Jan 12 11:54 raspberrypi3-v2.58.3+rev1.prod.img

By the way, check the ls -l ~/.balena/cache command above. Any downloaded file is cached (so it does not need to be downloaded again). I wonder if it could happen that initial download failed at 83%, then the file was kept instead of deleted. Although, probably not, as otherwise subsequent commands would simply print "The image was downloaded successfully" instead of failing at any percentage.

Anyway, I have tried os download command 5 times in a balenalib/intel-nuc-alpine-node:12-build container on my laptop, and they all succeeded. It's not clear what the problem would be.

vipulgupta2048 commented 3 years ago

And, did you happen to try running the container on your laptop / desktop computer with Docker, instead of the Fin?

No, I haven't since balena os download did work on a desktop. But, didn't try container on desktop. This is a bit randomly occurring as well. ~So, I will try it on my desktop container for about 5+ times (deleting the cache each time around) to see the results for sure.~ I see you already did the trials. Did you delete the cache every time too?

wonder if it could happen that initial download failed at 83%, then the file was kept instead of deleted.

Yes, after about running and retrying balena os download 50 or so times :sweat_smile: We have done the following and have some observations:

  1. balena os download is smart, does check the integrity of the download. So it does end up downloading again if it failed last time.
  2. If the download was successful before then it checks the image if it's correct with the integrity check. Takes a bit of time but in the end, does show a download successfully message.
  3. To reduce the chances of this behavior occurring, we have created named volumes bound to our download path and balenaCLI's cache path. So once we get the download done right, we aren't downloading again. Here CLI's check comes in handy again as it checks the image on every test run but isn't downloading again.

This has been working quite well for us at this point. But, yeah we aren't clear what the problem is as well. It's weird, very specific, and strangely elusive. Opened if anyone comes across this, googles and finds this issue.

pdcastro commented 3 years ago

Did you delete the cache every time too?

Yes. It sounds like the issue was only reproducible on the device.

vipulgupta2048 commented 3 years ago

@pdcastro I got someone else deploying the same image this week. Completely new, let's see if they face the same issue as well.

klutchell commented 3 years ago

I'm running into this with a fresh install of CLI v12.44.23 on Raspberry Pi OS (no container). Will see if I can find a root cause.

pi@raspberrypi:~ $ node -v
v12.22.3
pi@raspberrypi:~ $ npm -v
6.14.13
pi@raspberrypi:~ $ docker version
Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:57:27 2021
 OS/Arch:           linux/arm
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:55:25 2021
  OS/Arch:          linux/arm
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
pi@raspberrypi:~ $ uname -a
Linux raspberrypi 5.10.17-v7l+ #1414 SMP Fri Apr 30 13:20:47 BST 2021 armv7l GNU/Linux
pi@raspberrypi:~ $ balena os download raspberry-pi -o balena.img --debug
[debug] new argv=[/home/pi/.nvm/versions/node/v12.22.3/bin/node,/home/pi/.nvm/versions/node/v12.22.3/bin/balena,os:download,raspberry-pi,-o,balena.img] length=6
Getting device operating system for raspberry-pi
OS version not specified: using latest stable version
Downloading Device OS [========================] 100% eta 26s 
Error: unexpected end of file
    at Zlib.zlibOnError [as onerror] (zlib.js:182:17)
klutchell commented 3 years ago

I suspect this is related to arm32 and not containers specifically, since I reproduced in Raspberry Pi OS. I haven't found a root cause yet but I'm still digging.

klutchell commented 3 years ago

I can reproduce this with balena-image-manager but not with balena-sdk directly, so at least I've narrowed it down some more.

const { getSdk } = require('balena-sdk');

const balena = getSdk({
    apiUrl: "https://api.balena-cloud.com/",
    dataDirectory: "/opt/local/balena"
});

const fs = require('fs');
const manager = require('balena-image-manager');

// works
// balena.models.os.download('raspberrypi3').then(function(stream) {
//     stream.pipe(fs.createWriteStream('balena.img'));
// });

// fails
manager.get('raspberrypi3', 'default').then(function(stream) {
    stream.pipe(fs.createWriteStream('balena.img'));
});
Error: unexpected end of file
    at Zlib.zlibOnError [as onerror] (zlib.js:182:17)
Emitted 'error' event on Gunzip instance at:
    at Gunzip.onerror (/home/pi/sdk-test/node_modules/readable-stream/lib/_stream_readable.js:640:52)
    at Gunzip.emit (events.js:314:20)
    at Zlib.zlibOnError [as onerror] (zlib.js:185:8) {
  errno: -5,
  code: 'Z_BUF_ERROR'
}
klutchell commented 3 years ago

As far as I can tell, the issue arises with our use of PassThrough streams in balena-image-manager v7.0.3 (latest).

https://github.com/balena-io-modules/balena-image-manager/blob/2a4e905f7b125396303b69bff37bc901038c25d5/lib/manager.js#L28-L57

    // Piping to a PassThrough stream is needed to be able
    // to then pipe the stream to multiple destinations.
    const pass = new stream.PassThrough();
    // If we return `pass` directly, the client will not be able
    // to read all data from it after a delay, since it will be
    // instantly piped to `cacheStream`.
    // The solution is to create yet another PassThrough stream,
    // pipe to it and return the new stream instead.
    const pass2 = new stream.PassThrough();

These transformers use buffers that have a fixed size, smaller on 32-bit systems.

# amd64
> (require('buffer').constants.MAX_LENGTH)
2147483647

# armv7
> (require('buffer').constants.MAX_LENGTH)
1073741823

In testing I found that even one of these transformers is enough to hit the limit, let alone the two connected pipes that we are using to download +800mb files.

If I just return imageStream from the function above and remove all the the transformers I can download the image using a sample script.

const { getSdk } = require('balena-sdk');

const balena = getSdk({
    apiUrl: "https://api.balena-cloud.com/",
    dataDirectory: "/opt/local/balena"
});

const fs = require('fs');
const manager = require('balena-image-manager');

manager.cleanCache();

manager.get('raspberrypi3', 'default').then(function(stream) {
    stream.pipe(fs.createWriteStream('balena.img'));
});

The only solution I can see as a node.js novice is to use a different streaming method that doesn't require buffering so much of the file.

pdcastro commented 3 years ago

Great findings @klutchell 👍

I don't doubt that there is a bug in that doDownload function, but I am less clear that it is because PassThrough or Transform streams are hitting the MAX_LENGTH buffer size. My understanding is that PassThrough (and well implemented streams in general) may perform buffering up to the highWaterMark level (an option at stream creation) that defaults to 16 KB -- see documentation on stream buffering and the highWaterMark default. 16KB is well clear of the 1GB MAX_LENGTH limit that you've uncovered for 32-bit implementations of Node.js.

For reference, the source code for PassThrough and Transform streams can be found at:

Perhaps we need to dig further in the implementation of doDownload. There 4 streams there:

When balena os download hangs at 83%, what is the status of each of those streams? Which of them caused backpressure, if any? At that point, did the cacheStream manage to write a file on disk with the .inprogress extension? Is the pass.on('end', cacheStream.persistCache) event ever triggered?

klutchell commented 3 years ago

@pdcastro I'm having trouble reproducing the issue with my vscode debugger attached to a remote Raspberry Pi OS device. I hope the debugger isn't changing something in the environment that would change the result?

pdcastro commented 3 years ago

vscode debugger attached to a remote Raspberry Pi OS device

I've never tried it. :-) If you can reproduce the error in a production installation, what I think I would do is the old school console.error() logging by editing JS files directly in a global production install of the CLI. E.g.:

Of course, any editing made with vim as suggested above is not version controlled and would be lost if the CLI is reinstalled. But it is certainly flexible. :-)

pdcastro commented 3 years ago

By the way, to deeply inspect objects, Node.js provides an inspect function:

console.error(require('util').inspect(someObject));
klutchell commented 3 years ago

I've spent far too long on this already, but here's an update:

So CPU and memory on the device seem to be under control, yet the download fails at 83-85%.

When balena os download hangs at 83%, what is the status of each of those streams? Which of them caused backpressure, if any?

I was dumping all the stream states to the logs but none of them pointed to an issue. Sometimes they were corked/uncorked and sometimes they were drained but everything appeared to be operating per spec. It was also too much information to see an obvious pattern with my simple, monkey brain.

At that point, did the cacheStream manage to write a file on disk with the .inprogress extension?

Yeah, the .inprogress file is always created and usually nearly the full size of the image at the time of failure, so data is streaming to disk for sure.

Is the pass.on('end', cacheStream.persistCache) event ever triggered?

No, in the case of failure I haven't seen be called, and the .inprogress file does not get renamed.

This is not 100% reproducible, but in certain environments I can get it almost every test.

I strongly suspect that slow disk write speeds, like seen on SD cards, is a big part of the problem. I had hoped the artificial throttling would help with this but it seems not.

Is there anything specific I can look for in the stream states during failure?

Also why is it always between 83-85% when it fails, what happens at that time?

zwhitchcox commented 3 years ago

@vipulgupta2048 What do I need to have in .npmrc? I'm getting the error BuildSecretMissingError: .npmrc

vipulgupta2048 commented 3 years ago

@zwhitchcox Don't understand in which context do you mean? What component are you trying to reproduce this with to get the error?

From my experience and pure guess, this error comes along when private npm packages are being installed and you would be needing a NPM API key in your npmrc working env to authorize their downloads. This was the case before when one of the packages we used was private called testbotSDK but that is no longer the case, we removed the need for .npmrc by making the SDK public.