DeployQL / LintDB

Vector Database with support for late interaction and token level embeddings.
https://www.lintdb.com/
Apache License 2.0
37 stars 0 forks source link

Proposal: Create Reproducible builds #30

Open mtbarta opened 2 weeks ago

mtbarta commented 2 weeks ago

We've noticed that building differs between conda, cmake, and github actions. None of this lends confidence to the process, and it's important for LintDB to understand what it's linking against. e.g. Is it using openBLAS vs MKL, or what version of openMP is linked against. A changing build environment can make it difficult to pinpoint what dependencies we're actually using.

Background

LintDB primarily relies on conda to publish binaries. This follows Faiss' example -- performance depends heavily on linking correctly to c++ binaries, and conda helps clients do this successfully.

Conda creates the C++ shared object and then independently creates the Python SWIG library. This involves two calls to Cmake, and each step has its own environment specified in conda.

We've noticed differences in building locally vs GI, and also between conda and calling Cmake directly. This has lead to a lot of time being spent in aligning the build process to be as similar as possible and hunting for changes that happened under the hood somewhere.

Cmake and vcpkg

Within Cmake, we use vcpkg and the package manager for C++ dependencies. vcpkg has required a couple of custom ports (faiss, MKL, and onnxruntime) in order to be useful. Even still, depending on tokenizers-cpp went around this and is pulled in as a submodule.

vcpkg allows versioning through specifying the git hash of the vcpkg commit that should be used. Specifying this hash has led to some odd errors from MKL, and it's not clear what changes when the versions should all remain the same.

Proposal: Replace Cmake with Bazel

I'm wary of Bazel due to its complexity, but it could make sense.

My next comment outlines some of the thinking around Bazel.

mtbarta commented 2 weeks ago

Release 0.4.0 builds with clang and llvm omp. The linking looks right, but it's not ideal that conda and CMake both look for different MKL installations. We have some dependency complexity that is attributable to conda and Cmake that is causing this to be difficult.

Conda creates its own environment for builds, so theoretically we can "reproduce" a build with a yaml file specifying the conda environment. However, I also want to be able to build without conda.

Calling cmake within conda seems to be a weak point. Our Cmake setup isn't that robust, and vcpkg within it isn't well specified. For example, setting a baseline version for vcpkg causes MKL errors for reasons I don't understand.

To resolve build complexity and have guaranteed reproducible builds, I unfortunately think of bazel.

historical context

CMake was initially chosen expressly to avoid the complexity of bazel. I'm a big fan of bazel, but it adds a lot of overhead. I thought CMake would be quicker to setup a project and not require as much time fiddling with dependencies.

Pros of Bazel

Cons of Bazel

Alternatives

  1. We could stay with Cmake and replace vcpkg. I have no experience with Conan/alternatives, however. This doesn't change how Cmake resolves dependencies, which is the main issue.

  2. We could remove vcpkg completely and use a dockerfile to have a clean build environment. Installing new dependencies would be the same as downloading them locally. The downside is that the build process now depends on docker, and likely becomes slower.

Questions to answer

Is hermetic and reproducible builds important enough to invest a lot of time?

Is Bazel our only option?