blue-yonder / turbodbc

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
http://turbodbc.readthedocs.io/en/latest
MIT License
623 stars 85 forks source link

Compiling from source fails to find pyarrow #314

Open pecigonzalo opened 3 years ago

pecigonzalo commented 3 years ago

As already reflected in https://github.com/blue-yonder/turbodbc/issues/276 compiling from source fails to find pyarrow outside of conda.

This is using pyarrow installation from wheels.

Reproduce in

FROM python:3.8-slim-buster
RUN apt-get update \
    && apt-get install --no-install-recommends -y \
    g++ \
    ninja-build cmake git-core wget \
    libboost-all-dev \
    unixodbc unixodbc-dev \
    python-dev \
    && apt-get clean

RUN pip install --user pybind11==2.6.2 pyarrow==3.0.0

# Attempt to make the container find the pyarrow lib.
ENV LD_LIBRARY_PATH=/root/.local/lib/python3.8/site-packages/pyarrow:$LD_LIBRARY_PATH
RUN pip install --user turbodbc==4.2.0

I dont understand why https://github.com/blue-yonder/turbodbc/issues/276 was closed as many users are reporting the exact same issue. The issue is likely due to pyarrow .so files being suffixed with .300 for version 3.0.0 and so on.

The following comment (which links to the actual comments) compiling from source is mentioned as symlinking the names will not work, but its not clear what needs to be compiled.

Sample error output:

[...]
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp: In member function ‘void turbodbc_arrow::{anonymous}::string_converter::rebind_to_maximum_length(const arrow::BinaryArray&, std::size_t, std::size_t)’:
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp:101:33: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘std::size_t’ {aka ‘long unsigned int’} [-Wsign-compare]
#7 416.6                for (int64_t i = 0; i != elements; ++i) {
#7 416.6                                    ~~^~~~~~~~~~~
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp: In member function ‘void turbodbc_arrow::{anonymous}::string_converter::set_batch_utf16(std::size_t, std::size_t)’:
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp:140:31: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘std::size_t’ {aka ‘long unsigned int’} [-Wsign-compare]
#7 416.6              for (int64_t i = 0; i != elements; ++i) {
#7 416.6                                  ~~^~~~~~~~~~~
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp: In function ‘std::shared_ptr<arrow::Table> turbodbc_arrow::unwrap_pyarrow_table(const pybind11::object&)’:
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp:427:64: warning: ‘arrow::Status arrow::py::unwrap_table(PyObject*, std::shared_ptr<arrow::Table>*)’ is deprecated: Use Result-returning version [-Wdeprecated-declarations]
#7 416.6          if (not arrow::py::unwrap_table(pyarrow_table.ptr(), &table).ok()) {
#7 416.6                                                                     ^
#7 416.6     In file included from src/turbodbc_arrow/set_arrow_parameters.cpp:3:
#7 416.6     /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/pyarrow.h:54:30: note: declared here
#7 416.6        ARROW_PYTHON_EXPORT Status unwrap_##FUNC_SUFFIX(PyObject*,                           \
#7 416.6                                   ^~~~~~~
#7 416.6     /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/pyarrow.h:54:30: note: in definition of macro ‘DECLARE_WRAP_FUNCTIONS’
#7 416.6        ARROW_PYTHON_EXPORT Status unwrap_##FUNC_SUFFIX(PyObject*,                           \
#7 416.6                                   ^~~~~~~
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp:427:64: warning: ‘arrow::Status arrow::py::unwrap_table(PyObject*, std::shared_ptr<arrow::Table>*)’ is deprecated: Use Result-returning version [-Wdeprecated-declarations]
#7 416.6          if (not arrow::py::unwrap_table(pyarrow_table.ptr(), &table).ok()) {
#7 416.6                                                                     ^
#7 416.6     In file included from src/turbodbc_arrow/set_arrow_parameters.cpp:3:
#7 416.6     /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/pyarrow.h:54:30: note: declared here
#7 416.6        ARROW_PYTHON_EXPORT Status unwrap_##FUNC_SUFFIX(PyObject*,                           \
#7 416.6                                   ^~~~~~~
#7 416.6     /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/pyarrow.h:54:30: note: in definition of macro ‘DECLARE_WRAP_FUNCTIONS’
#7 416.6        ARROW_PYTHON_EXPORT Status unwrap_##FUNC_SUFFIX(PyObject*,                           \
#7 416.6                                   ^~~~~~~
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp: In instantiation of ‘void turbodbc_arrow::{anonymous}::string_converter::set_batch_of_type(std::size_t, std::size_t) [with String = std::__cxx11::basic_string<char>; std::size_t = long unsigned int]’:
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp:173:57:   required from here
#7 416.6     src/turbodbc_arrow/set_arrow_parameters.cpp:121:33: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘std::size_t’ {aka ‘long unsigned int’} [-Wsign-compare]
#7 416.6                for (int64_t i = 0; i != elements; ++i) {
#7 416.6                                    ~~^~~~~~~~~~~
#7 416.6     In file included from /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/platform.h:28,
#7 416.6                      from /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/pyarrow.h:20,
#7 416.6                      from src/turbodbc_arrow/set_arrow_parameters.cpp:3:
#7 416.6     /usr/local/include/python3.8/datetime.h: At global scope:
#7 416.6     /usr/local/include/python3.8/datetime.h:189:25: warning: ‘PyDateTimeAPI’ defined but not used [-Wunused-variable]
#7 416.6      static PyDateTime_CAPI *PyDateTimeAPI = NULL;
#7 416.6                              ^~~~~~~~~~~~~
#7 416.6     gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Iinclude/ -I/root/.local/lib/python3.8/site-packages/pybind11/include -I/root/.local/lib/python3.8/site-packages/pyarrow/include -I/usr/local/include/python3.8 -c src/turbodbc_arrow/arrow_result_set.cpp -o build/temp.linux-x86_64-3.8/src/turbodbc_arrow/arrow_result_set.o --std=c++11 -fvisibility=hidden
#7 416.6     In file included from /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/platform.h:28,
#7 416.6                      from /root/.local/lib/python3.8/site-packages/pyarrow/include/arrow/python/pyarrow.h:20,
#7 416.6                      from src/turbodbc_arrow/arrow_result_set.cpp:7:
#7 416.6     /usr/local/include/python3.8/datetime.h:189:25: warning: ‘PyDateTimeAPI’ defined but not used [-Wunused-variable]
#7 416.6      static PyDateTime_CAPI *PyDateTimeAPI = NULL;
#7 416.6                              ^~~~~~~~~~~~~
#7 416.6     g++ -pthread -shared -Wl,--strip-all build/temp.linux-x86_64-3.8/src/turbodbc_arrow/python_bindings.o build/temp.linux-x86_64-3.8/src/turbodbc_arrow/set_arrow_parameters.o build/temp.linux-x86_64-3.8/src/turbodbc_arrow/arrow_result_set.o -Lbuild/lib.linux-x86_64-3.8 -L/root/.local/lib/python3.8/site-packages/pyarrow -L/usr/local/lib -lodbc -larrow -larrow_python -lturbodbc.cpython-38-x86_64-linux-gnu -o build/lib.linux-x86_64-3.8/turbodbc_arrow_support.cpython-38-x86_64-linux-gnu.so -Wl,-rpath,$ORIGIN -Wl,-rpath,$ORIGIN/pyarrow
#7 416.6     /usr/bin/ld: cannot find -larrow
#7 416.6     /usr/bin/ld: cannot find -larrow_python
#7 416.6     collect2: error: ld returned 1 exit status
#7 416.6     error: command 'g++' failed with exit status 1
#7 416.6     ----------------------------------------
#7 416.6 ERROR: Command errored out with exit status 1: /usr/local/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ji0uyzrh/turbodbc_7775e8dbafcc47e48727f140b05fac07/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ji0uyzrh/turbodbc_7775e8dbafcc47e48727f140b05fac07/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-dinks7e7/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /root/.local/include/python3.8/turbodbc Check the logs for full command output.
#7 ERROR: executor failed running [/bin/sh -c pip install --user turbodbc==4.2.0]: exit code: 1
pecigonzalo commented 3 years ago

This build works:

FROM python:3.8-slim-buster as deps
RUN apt-get update \
    && apt-get install --no-install-recommends -y \
    g++ \
    ninja-build cmake git-core wget \
    libboost-all-dev \
    unixodbc unixodbc-dev \
    python-dev \
    && apt-get clean

RUN pip install --user pybind11==2.6.2 pyarrow==3.0.0

FROM deps as turbodbc

RUN ln -rvs \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.300 \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow.so
RUN ln -rvs \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.300 \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so
RUN ln -rvs \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_flight.so.300 \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_flight.so
RUN ln -rvs \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_python.so.300 \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_python.so
RUN ln -rvs \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_python_flight.so.300 \
  /root/.local/lib/python3.8/site-packages/pyarrow/libarrow_python_flight.so
RUN ln -rvs \
  /root/.local/lib/python3.8/site-packages/pyarrow/libparquet.so.300 \
  /root/.local/lib/python3.8/site-packages/pyarrow/libparquet.so
RUN ln -rvs \
  /root/.local/lib/python3.8/site-packages/pyarrow/libplasma.so.300 \
  /root/.local/lib/python3.8/site-packages/pyarrow/libplasma.so
RUN pip install --user turbodbc==4.2.0

But I dont know if the software will work as a commented in the linked issue has then following comment:

You will end up with random segmentation faults otherwise.

in reference to symlinking.

This also means we cant define turbodbc==4.2.0 in a requirements.txt together with pyarrow because we need to do a manual step in between.

pecigonzalo commented 3 years ago

The fix that was mentioned in the previous issue, is likely the one in this doc https://arrow.apache.org/docs/python/extending.html#building-extensions-against-pypi-wheels and referenced in this comment https://github.com/blue-yonder/turbodbc/issues/276#issuecomment-839689005.

I think its a bad call from pyarrow to ask consumers to modify the installation.

ldacey commented 3 years ago

The documentation you linked was helpful for me. I am now able to get turbodbc up and running without conda for the first time. I am installing pyarrow in a separate RUN command with some other dependencies, then I have a line which runs the create_library_symlinks() command. Finally, the rest of my requirements (including turbodbc and airflow-providers-odbc) are installed.

RUN pip install --user --upgrade pip \
    && pip install --no-cache --user \
    python-snappy \
    pybind11 \
    numpy \
    pyarrow==5.0.0 \
    apache-airflow[password,crypto]==${AIRFLOW_VERSION}

RUN python -c "import pyarrow; pyarrow.create_library_symlinks()"

RUN pip install --no-cache --user -r requirements.txt
ldacey commented 3 years ago

Well, the build worked but then turbodbc was not able to find pyarrow during actual tasks. Both libraries are installed in the same environment. I will try @pecigonzalo's approach with symlinks

I know this works with conda, but I want to move towards using the official apache/airflow image which does not use conda. The only failure is turbodbc right now.

DevangB9 commented 2 years ago

I am facing the same issue, @idacey did you find any solution?

@xhochy I went through this : https://github.com/blue-yonder/turbodbc/issues/276 and https://github.com/blue-yonder/turbodbc/issues/227.

I'm using Ubuntu 20.04 in a windows system. Any help would be great. Thanks a lot

ldacey commented 2 years ago

Negative. I ended up installing with mamba instead and used a package called conda-pack to avoid having conda installed in my final image.

COPY ${ENV_FILE} /conda-env.yml

#creates the conda environment from conda-env.yml and unpacks it to be copied from the /venv folder
RUN mamba env create -f /conda-env.yml \
    && /opt/conda/envs/airflow/bin/conda-pack --name airflow --ignore-missing-files --output /tmp/env.tar.gz \
    && mkdir -p ${VIRTUAL_ENV} \ 
    && cd ${VIRTUAL_ENV} \
    && tar -xvf /tmp/env.tar.gz \
    && rm /tmp/env.tar.gz \
    && ${VIRTUAL_ENV}/bin/conda-unpack \
    && conda clean -afy

WORKDIR ${VIRTUAL_ENV}

My final image copies my venv folder which results in a working pyarrow without anaconda installed .

COPY --chown=airflow:root --from=python-dependencies /venv /venv

I am still hoping for the day when I can pip install everything since a chunk of my most important libraries are not on conda at all.

david-engelmann commented 2 years ago

I am facing the same issue, @idacey did you find any solution?

@xhochy I went through this : #276 and #227.

I'm using Ubuntu 20.04 in a windows system. Any help would be great. Thanks a lot

@DevangB9 I recently was able to solve this issue and posted it in this comment.