balena-io / balena-cli

The official balena CLI tool.
Apache License 2.0
452 stars 138 forks source link

Node buffer too small for large images #1031

Open zackbright opened 5 years ago

zackbright commented 5 years ago

I can pull up the CLI/Node/Npm versions as needed, but this issue is theoretically independent of such.

When manipulating large images, i.e. >buffer size for node some of the files are loaded as single files instead of a stream.

e.g. I run below on a Dockerfile that builds from another codebase and compiles for my device. For reference I'm building ROS2.

sudo balena build --deviceType rasperrypi3 --arch armhf --emulated .

After about 16 hours of compiling... yup... the image generated is quite large. The docker image is ~4.2GB If I attempt to deploy, I get:

sudo balena deploy talker3 talker3.tar 
[Info]    Creating default composition with image: talker3.tar
[Info]    Building for armv7hf/raspberrypi3
[Build]   Built 1 service in 0 seconds
[Build]   main Preparing...
[Error]   Deploy failed
ERR_FS_FILE_TOO_LARGE: File size (4192567296) is greater than possible Buffer: 2147483647 bytes

I managed to get everything on my device by pushing to my docker hub repo and pulling on the device.

However, I noticed my entrypoint needed a small tweak (rookie mistake: didn't chmod +x first) Luckily the copy is after compiling, so I can try to rebuild the image and leverage the previous layers where compilation happened:

sudo balena build --deviceType rasperrypi3 --arch armhf --emulated .
[Info]    Creating default composition with source: /Users/zackbright/resin_dev/ros2_app/talker3
[Info]    Building for armhf/rasperrypi3
[Build]   Built 1 service in 0 seconds
[Build]   main Preparing...
[Error]   Build failed
ERR_FS_FILE_TOO_LARGE: File size (4192567296) is greater than possible Buffer: 2147483647 bytes

If you need help, don't hesitate in contacting us at:

  GitHub: https://github.com/balena-io/balena-cli/issues/new
  Forums: https://forums.balena.io

So, somewhere in the read of the of local existing image, fs.readFile or something similar is used and is throwing ERR_FS_FILE_TOO_LARGE

Not sure exactly where the call is being made without going through the entire codebase, figure I would post here to see if it can be patched or if this is an upstream error from docker/node that cannot be avoided.

dfunckt commented 5 years ago

Can you please try running the same command with DEBUG=1 BLUEBIRD_DEBUG=1 balena build ... and paste the output here?

zackbright commented 5 years ago

Here you are.

Looks like it is definitely the case the NodeJS is not able to handle this large a file and thus needs to load things in as a stream. A quick google pulls up some ways others have done this, but I don't want to be telling you all how to build. Especially, I'm unsure how reading in as stream may effect the actual processing of the images.

sudo DEBUG=1 BLUEBIRD_DEBUG=1 balena build --deviceType rasperrypi3 --arch armhf --emulated .
[Debug]   Parsing input...
[Debug]   Loading project...
[Debug]   Resolving project...
[Debug]   Failed to resolve project: 
[Debug]   AggregateError of:
[Debug]       Error: ENOENT: no such file or directory, open '/Users/zackbright/resin_dev/ros2_app/talker3/docker-compose.yml'
[Debug]       Error: ENOENT: no such file or directory, open '/Users/zackbright/resin_dev/ros2_app/talker3/docker-compose.yaml'
[Info]    Creating default composition with source: /Users/zackbright/resin_dev/ros2_app/talker3
[Debug]   Creating project...
[Info]    Building for armhf/rasperrypi3
[Build]   Built 1 service in 0 seconds
[Build]   main Preparing...
[Error]   Build failed
RangeError [ERR_FS_FILE_TOO_LARGE]: File size (4192567296) is greater than possible Buffer: 2147483647 bytes
    at FSReqWrap.readFileAfterStat [as oncomplete] (fs.js:262:11)
From previous event:
    at Object.readFile (eval at thenify.withCallback (/usr/local/lib/node_modules/balena-cli/node_modules/thenify/index.js:33:10), <anonymous>:10:8)
    at /usr/local/lib/node_modules/balena-cli/build/utils/compose.js:139:52
From previous event:
    at tarDirectory (/usr/local/lib/node_modules/balena-cli/build/utils/compose.js:136:28)
    at /usr/local/lib/node_modules/balena-cli/build/utils/compose.js:192:12
From previous event:
    at Object.exports.buildProject (/usr/local/lib/node_modules/balena-cli/build/utils/compose.js:191:6)
    at /usr/local/lib/node_modules/balena-cli/build/actions/build.js:28:20
From previous event:
    at buildProject (/usr/local/lib/node_modules/balena-cli/build/actions/build.js:22:88)
    at /usr/local/lib/node_modules/balena-cli/build/actions/build.js:94:16
From previous event:
    at /usr/local/lib/node_modules/balena-cli/build/actions/build.js:93:22
From previous event:
    at Command.action (/usr/local/lib/node_modules/balena-cli/build/actions/build.js:90:8)
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/build/command.js:98:37
    at Command.module.exports.Command.applyPermissions (/usr/local/lib/node_modules/balena-cli/node_modules/capitano/build/command.js:43:14)
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/build/command.js:92:24
    at Command.module.exports.Command._checkElevation (/usr/local/lib/node_modules/balena-cli/node_modules/capitano/build/command.js:63:14)
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/build/command.js:76:22
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/build/signature.js:176:14
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/node_modules/async/lib/async.js:52:16
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/node_modules/async/lib/async.js:269:32
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/node_modules/async/lib/async.js:44:16
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/build/signature.js:170:16
    at /usr/local/lib/node_modules/balena-cli/node_modules/capitano/node_modules/async/lib/async.js:181:20
    at Immediate.iterate (/usr/local/lib/node_modules/balena-cli/node_modules/capitano/node_modules/async/lib/async.js:262:13)
    at runCallback (timers.js:705:18)
    at tryOnImmediate (timers.js:676:5)
    at processImmediate (timers.js:658:5)
    at process.topLevelDomainCallback (domain.js:121:23)
zackbright commented 5 years ago

From reviewing the output above, it was clear that the image being pulled was not the docker stored image (i.e. the one living in the docker memory), but the .tar of the image that was saved locally. I moved the .tar so that the image was not found. Turns out that this seems to have fixed things.

Probably still worth looking into as I was not able to deploy without the tar image.

I will see if I can deploy straight from the in-memory image in docker (i.e. docker image ls returned images) rather than from the saved tar archive.

zackbright commented 5 years ago

Forcing the rebuild invoked docker to use the in-memory version of the image.

Got my happy unicorn to show up.

I'm adding the details to be sure if anyone else has this issue that they may resolve it.

Build locally and as long as you have docker still going, you can rebuild and since it is a docker image the build doesn't actually force a re-compiling or rebuilding of the code (may be obvious to most who are familiar with docker, but I hesitated and didn't understand this fully as a neophyte).

So by re-invoking the build as shown below, you can get the built image from local docker and deploy.

sudo DEBUG=1 BLUEBIRD_DEBUG=1 balena deploy talker3 --build --deviceType rasperrypi3 --arch armhf --emulated .
Password:
[Debug]   Parsing input...
[Debug]   Loading project...
[Debug]   Resolving project...
[Debug]   Failed to resolve project: 
[Debug]   AggregateError of:
[Debug]       Error: ENOENT: no such file or directory, open '/Users/zackbright/resin_dev/ros2_app/talker3/docker-compose.yml'
[Debug]       Error: ENOENT: no such file or directory, open '/Users/zackbright/resin_dev/ros2_app/talker3/docker-compose.yaml'
[Info]    Creating default composition with source: /Users/zackbright/resin_dev/ros2_app/talker3
[Debug]   Creating project...
[Info]    Building for armv7hf/raspberrypi3
[Debug]   Prepared tasks; building...
[Build]   Built 1 service in 0 seconds
[Build]   main Image size: 3.79 GB
[Info]    Creating release...
[Debug]   Tagging images...
[Debug]   Authorizing push...
[Info]    Pushing images to registry...
[Debug]   Saving image registry2.balena-cloud.com/v2/9a83c2c628ef523b5fea9f45d98637df
[Debug]   Untagging images...
[Info]    Saving release...
[Success] Deploy succeeded!
[Success] Release: 3fb40a2603cb98c61a31d10743c67c57

                \
                 \
                  \\
                   \\
                    >\/7
                _.-(6'  \
               (=___._/` \
                    )  \ |
                   /   / |
                  /    > /
                 j    < _\
             _.-' :      ``.
             \ r=._\        `.
            <`\\_  \         .`-.
             \ r-7  `-. ._  ' .  `\
              \`,      `-.`7  7)   )
               \/         \|  \'  / `-._
                          ||    .'
                           \\  (
                            >\  >
                        ,.-' >.'
                       <.'_.''
                         <'
CameronDiver commented 5 years ago

Hey @zackbright thanks for bringing this up! This has actually revealed quite a few architectural problems in the way that build sources are packaged and sent to the docker daemon (whether local or remote). Basically, even though we're using a streaming interface to the tar creation, the other end of the stream isn't read until the entire tar archive has been packed into the stream.

There's a few different places where this occurs, so I'm going to work on fixing these up, and it should have the advantage of making balena build and balena push (both cloud and local variants) much more stable.

I'll reference this issue in the PRs if you want to track the progress of this.

srlowe commented 4 years ago

Closing as fixes have been merged. Please reopen if you still encounter problems.

Langhalsdino commented 4 years ago

I encountered the problem while building an image containing CUDA, TF, OpenCV with the Nvidia Jetson Nano.

It works on balena CLI V. 11.24.0 but does not work with v11.26.0, but i will try it with v11.28.11 on Linux :)

 [Debug]   Parsing input...
 [Debug]   Loading project...
 [Debug]   Resolving project...
 [Info]    Compose file detected
 [Debug]   Creating project...
 [Info]    Building for aarch64/jetson-nano
 [Build]   Building services...
 [Build]   jetson-nano-base          Preparing...
 [Info]    Installing qemu for aarch64 emulation...
 [Info]    Emulation is enabled
 [Build]   Built 1 service in 0:01
 [Error]   Build failed
ERR_FS_FILE_TOO_LARGE: File size (2380747015) is greater than possible Buffer: 2147483647 bytes
 RangeError [ERR_FS_FILE_TOO_LARGE]: File size (2380747015) is greater than possible Buffer: 2147483647 bytes
     at FSReqWrap.readFileAfterStat [as oncomplete] (fs.js:264:11)
Langhalsdino commented 4 years ago

The limit still seems to be present in v11.28.11

 [Debug]   Parsing input...
 [Debug]   Loading project...
 [Debug]   Resolving project...
 [Debug]   docker-compose.yml file found at "/builds/.../..."
 [Debug]   Creating project...
 [Info]    Building for aarch64/jetson-nano
 [Build]   Building services...
 [Build]   jetson-nano-base          Preparing...
 [Info]    Installing qemu for aarch64 emulation...
 [Info]    Emulation is enabled
 [Build]   Built 1 service in 0:01
 [Error]   Build failed
 ERR_FS_FILE_TOO_LARGE: File size (2380747015) is greater than possible Buffer: 2147483647 bytes
 RangeError [ERR_FS_FILE_TOO_LARGE]: File size (2380747015) is greater than possible Buffer: 2147483647 bytes
     at FSReqWrap.readFileAfterStat [as oncomplete] (fs.js:264:11)
 From previous event:
     at readFile (eval at thenify.withCallback (/snapshot/versioned-source/node_modules/thenify/index.js:33:10), <anonymous>:10:8)
     at /snapshot/versioned-source/build/utils/compose.js:156:49
 From previous event:
     at tarDirectory (/snapshot/versioned-source/build/utils/compose.js:153:28)
     at /snapshot/versioned-source/build/utils/compose.js:216:12
 From previous event:
     at Object.exports.buildProject (/snapshot/versioned-source/build/utils/compose.js:215:6)
     at /snapshot/versioned-source/build/actions/build.js:51:20
 From previous event:
     at buildProject (/snapshot/versioned-source/build/actions/build.js:45:60)
     at /snapshot/versioned-source/build/actions/build.js:133:16
 From previous event:
     at /snapshot/versioned-source/build/actions/build.js:132:22
 From previous event:
     at Command.action (/snapshot/versioned-source/build/actions/build.js:129:8)
     at /snapshot/versioned-source/node_modules/capitano/build/command.js:98:37
     at Command.module.exports.Command.applyPermissions (/snapshot/versioned-source/node_modules/capitano/build/command.js:43:14)
     at /snapshot/versioned-source/node_modules/capitano/build/command.js:92:24
     at Command.module.exports.Command._checkElevation (/snapshot/versioned-source/node_modules/capitano/build/command.js:63:14)
     at /snapshot/versioned-source/node_modules/capitano/build/command.js:76:22
     at /snapshot/versioned-source/node_modules/capitano/build/signature.js:177:14
     at _toString (/snapshot/versioned-source/node_modules/capitano/node_modules/async/lib/async.js:52:16)
     at /snapshot/versioned-source/node_modules/capitano/node_modules/async/lib/async.js:269:32
     at /snapshot/versioned-source/node_modules/capitano/node_modules/async/lib/async.js:44:16
     at /snapshot/versioned-source/node_modules/capitano/build/signature.js:171:16
     at _setImmediate (/snapshot/versioned-source/node_modules/capitano/node_modules/async/lib/async.js:181:20)
     at Immediate.iterate (/snapshot/versioned-source/node_modules/capitano/node_modules/async/lib/async.js:262:13)
     at runCallback (timers.js:705:18)
     at tryOnImmediate (timers.js:676:5)
     at processImmediate (timers.js:658:5)
     at process.topLevelDomainCallback (domain.js:126:23)
 If you need help, don't hesitate in contacting our support forums at
 https://forums.balena.io
 For CLI bug reports or feature requests, have a look at the GitHub issues or
 create a new one at: https://github.com/balena-io/balena-cli/issues/
pdcastro commented 4 years ago

@Langhalsdino, thank you for reporting it and including the stack trace. I am re-opening this issue.

In pre-transpiled code, I think the error comes from this line: https://github.com/balena-io/balena-cli/blob/v11.28.11/lib/utils/compose.coffee#L137

Probable bug explanation: the CLI is loading the whole file in a RAM-memory buffer before creating the tar stream that it sends to the Docker daemon over HTTP, and the buffer is not large enough. The fix will probably be to load the file in smaller chunks, or use Node's streams to do it for us.

I gather that it was already like that in CLI v11.24.0 though, so perhaps your file was a bit smaller when you were using that version.

Langhalsdino commented 4 years ago

I gather that it was already like that in CLI v11.24.0 though, so perhaps your file was a bit smaller when you were using that version.

I figured out why my build works for CLI v11.24.0 and not v11.26.0 or v11.28.11.

I added the --cache-from option to the build and therefore docker tries to push the large layers to the deamon. Since 11.24 did not have the --cache-from option the build succeeded, since it could not push the prebuild layers to the deamon :)

argv=[/usr/local/bin/balena,/snapshot/versioned-source/bin/balena,build,--application,jetson-app-name,--logs,--source,.,--emulated,--build-arg,BUILDKIT_INLINE_CACHE=1,--cache-from,registry.my.com/jetson:latest]
Langhalsdino commented 4 years ago

Mhm thats strange. I tried to run the same pipeline on different server and one succeeded on a big server 24-threads, 64GB RAM and it fails on a smaller one 12-threas, 14GB RAM.

But 2380747015 byte should be 2.3 GB right? Therefore this should not be a problem for neither of them. I will do it once more and investigate the system resources. Since I dockerized the build process there should not be any difference between the systems, except their specs.

Langhalsdino commented 4 years ago

There seems to be no real difference in resource usage, therefore i have no clue why both behave differently. Maybe the cache different images, run different dind versions, do caching differently. I dont know :/

For now i will stick with the server that works, eventhough i do not know the reason.

balena-ci commented 4 years ago

[jtonello] This issue has attached support thread https://jel.ly.fish/#/6ed1f99a-a3cf-4447-bc84-d649a73aa2b3

Langhalsdino commented 4 years ago

I just encountered the issue again and this time needed to solve it.

Here are the steps that helped me to resolve the issue. Eventhough I am not sure if all of them are needed to resolve the issue.

  1. create a new empty folder
  2. copy our docker-compose.yml file
  3. create an empty Dockerfile that is just sourcing from the upstream docker image e.g. FROM balenalib/jetson-nano-ubuntu-python:3.6-bionic
  4. run the build process with chache enabled balena build --application appName --logs --source . --emulated --build-arg BUILDKIT_INLINE_CACHE=1
  5. copy all the files from our repository that contains the project and replace the Dockerfile with the correct one
  6. run the build process with chache enabled balena build --application appName --logs --source . --emulated --build-arg BUILDKIT_INLINE_CACHE=1

Strange observations while figuring out to solve the symptoms: Just replacing the Dockerfile in our source code repository with an almost empty one did not change anything. The error ERR_FS_FILE_TOO_LARGE: File size (2380749827) is greater than possible Buffer: 2147483647 bytes still got raised, eventhough every COPY has been removed from the Dockerfile.

If there is more information you will need to debug this further (only if it is of interest to you). I am happy to assist you.

Its now working with the following tool versions: