OpenFn / apollo

GNU Lesser General Public License v2.1
0 stars 2 forks source link

Don't include node in the the docker image #43

Closed josephjclark closed 6 months ago

josephjclark commented 7 months ago

This is basically an optimisation of the build environment. I am not terribly excited about fixing it urgently.

Background

The docker image which runs the gen server currently requires node to be installed in order to build. This is silly because otherwise we do not depend on node at all (we use bun).

Why is node needed

deep breath

node-calls-python must be compiled after installation to build its native binaries.

This compilation process uses node-gyp.

node-gyp requires a node.js source build in order to generate a bunch of build variables. node-calls-python depends on at least one of these variables to build (llm_version, which has a value of "0_0", which looks like a placeholder to me by hey ho).

If node.js is not installed in the environment, node-gyp does not generate the correct build variables, and the build fails.

Interestingly, node-gyp will actually call out and download the headers it needs. So I don't think it needs an actual node installation, it might just need a version number in the env.

By the way, I don't think the build variables are material at all. I think some of the low level compilation scripts make strict assumptions about node, but the value of these variables matters little. After all, this is config for a node environment, but the binaries run quite happily in a bun environment.

Solutions

1. Use a builder image

A very simple approach is to use a builder image, with node installed, to install and build the server, then copy the node_modules into a new image which does not have node.

It feels like a lot of overhead for little gain, but it would work.

2. Build manually from a tarball

You can pass a tarball of node headers directly into node-gyp.

So one approach would be to download the headers for any modern node version (right now we randomly use node 20.x, it makes no difference if we randomly download the headers for that version), save them into the repo, and use a postinstall step to manually call node-gyp with the --tarball flag (see below).

We'd need to create a postinstall script for this to work and de-list node-gyp from the package dependencies, which actually could be a pain for local development.

3.pre-build config.gpy and manually build with it

I don't see why this wouldn't work.

We could save a pre-built config.gypi into the repo, copy it into node_modules/node-calls-python/build, and then manually run the final build step manually.

This should bypass all the node headers stuff and just build the model with the pre-configured environment.

4. Let node-gyp download the headers

Node-gyp will download its own headers. I think probably only need to configure the env to tell node-gyp which node version to use, then it'll go off and download the files it needs. I don't think it actually need the node runtime.

I suspect we can just hack the env to enable node-gyp to work.

We may want to delete the headers after building because they'll just bloat the image, but this doesn't feel important to me.

A note on manual builds

We can manually run node-gyp like this:

bun run node-gyp -C node_modules/node-calls-python configure     

This generates the config.gypi file which has the build context.

You can then pass --silly for detailed log otput, or --nodedir or --tarball or whatever to customise the build. See the list of node-gpy options

Note that if node-gyp is installed to the package, the build runs automatically on install. If we remove node-gyp, the build does not run, and we can intervene in a postinstall script

josephjclark commented 6 months ago

Irrelevant now that we're not using node-calls-python