TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
89 stars 16 forks source link

The new scikit-build-core setup copies external shared objects into Python wheel #717

Open jdblischak opened 7 months ago

jdblischak commented 7 months ago

There are various situations where we want to be able to build tiledbvcf-py against an existing external libtiledbvcf.so:

This is the same situation that we previously addressed for tiledbsoma-py in https://github.com/single-cell-data/TileDB-SOMA/pull/1937 and https://github.com/single-cell-data/TileDB-SOMA/pull/2221. Unfortunately tiledbsoma-py uses setup.py, so I can't directly apply the previous solution to the scikit-build-core setup we are now using for tiledbvcf-py.

I think two things need to happen:

  1. Do not copy shared objects into wheel
  2. Do not edit RUNPATH (so that libtiledbvcf.cpython-3XX-x86_64-linux-gnu.so can still find the external libtiledbvcf.so at runtime)

Here is a reprex to demonstrate the current shared object copying behavior:

docker run --rm -it ubuntu:22.04

# Setup
apt-get update
apt-get install --yes git python-is-python3 python3 python3-pip python3-venv unzip wget
cd
mkdir downloads install-libtiledb install-libtiledbvcf

# Install nightly binaries
the_date=2024-04-30
wget --quiet -P downloads/ \
  https://github.com/jdblischak/centralized-tiledb-nightlies/releases/download/$the_date/libtiledb-$the_date.tar.gz
wget --quiet -P downloads/ \
  https://github.com/jdblischak/centralized-tiledb-nightlies/releases/download/$the_date/libtiledbvcf-$the_date.tar.gz
tar -C install-libtiledb -xzf downloads/libtiledb-$the_date.tar.gz
tar -C install-libtiledbvcf -xzf downloads/libtiledbvcf-$the_date.tar.gz
export LD_LIBRARY_PATH=$(pwd)/install-libtiledb/lib:$(pwd)/install-libtiledbvcf/lib
ldd install-libtiledbvcf/lib/libtiledbvcf.so | grep libtiledb.so
##         libtiledb.so.2.21 => /root/install-libtiledb/lib/libtiledb.so.2.21 (0x00007fdffbde6000)

# Clone TileDB-VCF
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/

# Build and install tiledbvcf-py
export LIBTILEDBVCF_PATH=${HOME}/install-libtiledbvcf/lib/
python -m pip install -v apis/python
##   -- Searching for libtiledbvcf in /root/TileDB-VCF/apis/python/../../dist/lib;/root/install-libtiledbvcf/lib/
##   -- Found libtiledbvcf: /root/install-libtiledbvcf/lib/libtiledbvcf.so
##
##   *** Installing project into wheel...
##   -- Install configuration: "Release"
##   -- Installing: /tmp/tmp7pmlfypj/wheel/platlib/tiledbvcf/libtiledbvcf.cpython-310-x86_64-linux-gnu.so
##   -- Set non-toolchain portion of runtime path of "/tmp/tmp7pmlfypj/wheel/platlib/tiledbvcf/libtiledbvcf.cpython-310-x86_64-linux-gnu.so" to "$ORIGIN"
##   -- Installing: /tmp/tmp7pmlfypj/wheel/platlib/tiledbvcf/libtiledbvcf.so
##   -- Installing: /tmp/tmp7pmlfypj/wheel/platlib/tiledbvcf/libhts.so.1.19.1
##   -- Installing: /tmp/tmp7pmlfypj/wheel/scripts/tiledbvcf
python -c "import tiledbvcf; print(tiledbvcf.version)"
## 0.31.1.dev7+g7cc19c10

# External shared objects are installed into Python package
ls /usr/local/lib/python3.10/dist-packages/tiledbvcf/*.so*
## /usr/local/lib/python3.10/dist-packages/tiledbvcf/libhts.so.1.19.1
## /usr/local/lib/python3.10/dist-packages/tiledbvcf/libtiledbvcf.cpython-310-x86_64-linux-gnu.so
## /usr/local/lib/python3.10/dist-packages/tiledbvcf/libtiledbvcf.so

# And the RUNPATH is edited to no longer point to external shared object
readelf -d /usr/local/lib/python3.10/dist-packages/tiledbvcf/libtiledbvcf.cpython-310-x86_64-linux-gnu.so | grep RUNPATH
##  0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN]
readelf -d apis/python/build/cp310-cp310-linux_x86_64/libtiledbvcf.cpython-310-x86_64-linux-gnu.so | grep RUNPATH
##  0x000000000000001d (RUNPATH)            Library runpath: [/root/install-libtiledbvcf/lib:]

# Build wheel
python -m pip wheel -v --wheel-dir=apis/python/dist apis/python
##   -- Searching for libtiledbvcf in /root/TileDB-VCF/apis/python/../../dist/lib;/root/install-libtiledbvcf/lib/
##   -- Found libtiledbvcf: /root/install-libtiledbvcf/lib/libtiledbvcf.so
##
##   *** Installing project into wheel...
##   -- Install configuration: "Release"
##   -- Installing: /tmp/tmp34k3_ajz/wheel/platlib/tiledbvcf/libtiledbvcf.cpython-310-x86_64-linux-gnu.so
##   -- Set non-toolchain portion of runtime path of "/tmp/tmp34k3_ajz/wheel/platlib/tiledbvcf/libtiledbvcf.cpython-310-x86_64-linux-gnu.so" to "$ORIGIN"
##   -- Installing: /tmp/tmp34k3_ajz/wheel/platlib/tiledbvcf/libtiledbvcf.so
##   -- Installing: /tmp/tmp34k3_ajz/wheel/platlib/tiledbvcf/libhts.so.1.19.1
##   -- Installing: /tmp/tmp34k3_ajz/wheel/scripts/tiledbvcf

# The external shared objects are copied into the wheel
unzip -l apis/python/dist/tiledbvcf-0.31.1.dev7+g7cc19c10-cp310-cp310-linux_x86_64.whl | grep '.so'
##   1536208  2024-05-01 02:18   tiledbvcf/libhts.so.1.19.1
##    254392  2024-05-01 18:20   tiledbvcf/libtiledbvcf.cpython-310-x86_64-linux-gnu.so
##   3585408  2024-05-01 02:18   tiledbvcf/libtiledbvcf.so

xref: #701, #702

jdblischak commented 6 months ago

TileDB-Vector-Search is also being updated to not always copy the shared objects into the wheel

https://github.com/TileDB-Inc/TileDB-Vector-Search/pull/361 https://github.com/TileDB-Inc/tiledb-vector-search-feedstock/pull/35

jdblischak commented 6 months ago

Now that tiledb-vcf-feedstock was updated to 0.32.0 (https://github.com/TileDB-Inc/tiledb-vcf-feedstock/pull/120), which was the first to use the new scikit-build-core setup, the tiledbcf-py conda binaries ballooned in size since they now vendor libtiledb, libtiledbvcf, and htslib.

As a concrete example, linux-64/tiledbvcf-py-0.31.1-py39h1dd0e15_0.conda is 2.0 MB and linux-64/tiledbvcf-py-0.32.0-py39h59b0bc9_0.conda is 9.4 MB.

image

jdblischak commented 6 months ago

Not only does this duplication increase the size of our cloud Docker images, but it will complicate future libtiledb updates. If we release libtiledb 2.23.1, all the other conda binaries will automatically use the new libtiledb 2.23.1, but presumably tiledbvcf-py will continue to use its vendored libtiledb 2.23.0.

Shelnutt2 commented 6 months ago

but presumably tiledbvcf-py will continue to use its vendored libtiledb 2.23.0.

That is problematic. We need to do our best to ensure a single libtiledb is used and loaded. This simplifies for passing different structures back and forth in python (i.e creating a tiledb.Config and using it as a parameter to tiledbvcf.ReadConfig(tiledb_config)

jdblischak commented 3 months ago

I suspect this is the cause of the user reported error in https://forum.tiledb.com/t/tiledbvcf-installation-error-on-macos/710

When tiledbvcf-py is built, it builds against whatever libgoogle-cloud version that upstream tiledb is currently pinned to. It then vendors this libtiledb.dylib inside the Python package. Then when the upstream tiledb conda binary migrates to a new version of libgoogle-cloud, it gets installed into the conda env, but the bundled libtiledb.dylib still expects the libgoogle-cloud symbols from whatever previous version it happened to be built against.

Hence the short-term solution is to downgrade libgoogle-cloud until you find the compatible one that your verison of tiledbvcf-py was built against (Azure doesn't keep old build logs, so trial and error is the only option).

Long-term we need to stop copying the shared objects into the Python package, like we've already done for TileDB-Vector-Search (https://github.com/TileDB-Inc/TileDB-Vector-Search/pull/361) and TileDB-Py (https://github.com/TileDB-Inc/TileDB-Py/pull/1988). Maybe @dudoslav can help with this


x-ref: https://app.shortcut.com/tiledb-inc/story/54633