apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[Python] Build in Amazon Linux 2023 fails #38810

Closed bascheibler closed 7 months ago

bascheibler commented 11 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to build a slim version of PyArrow, so that it fits in an AWS Lambda function. The base Docker image is public.ecr.aws/lambda/python:3.12, which is an Amazon Linux 2023 OS (based on Fedora).

Building from the Dockerfile below, it fails when trying to create a wheel file. The error message I've got is:

/var/task/arrow/python/setup.py:34: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
/var/lang/lib/python3.12/site-packages/setuptools/__init__.py:80: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
!!

        ********************************************************************************
        Requirements should be satisfied by a PEP 517 installer.
        If you are using pip, you can try `pip install --use-pep517`.
        ********************************************************************************

!!
  dist.fetch_build_eggs(dist.setup_requires)
/var/lang/lib/python3.12/site-packages/setuptools_scm/git.py:135: UserWarning: "/var/task/arrow" is shallow and may cause errors
  warnings.warn(f'"{wd.path}" is shallow and may cause errors')
running build_ext
creating /var/task/arrow/python/build
creating /var/task/arrow/python/build/temp.linux-x86_64-cpython-312
-- Running cmake for PyArrow
cmake -DCMAKE_INSTALL_PREFIX=/var/task/arrow/python/build/lib.linux-x86_64-cpython-312/pyarrow -DPYTHON_EXECUTABLE=/var/lang/bin/python3 -DPython3_EXECUTABLE=/var/lang/bin/python3 -DPYARROW_CXXFLAGS= -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_SUBSTRAIT=off -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off -DPYARROW_BUILD_ACERO=on -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on -DPYARROW_BUILD_PARQUET_ENCRYPTION=off -DPYARROW_BUILD_GCS=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off -DPYARROW_BUNDLE_ARROW_CPP=on -DPYARROW_BUNDLE_CYTHON_CPP=off -DPYARROW_GENERATE_COVERAGE=off -DCMAKE_BUILD_TYPE=release /var/task/arrow/python
-- The C compiler identification is GNU 11.4.1
-- The CXX compiler identification is GNU 11.4.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Success
-- Arrow build warning level: PRODUCTION
-- Using ld linker
-- Build Type: RELEASE
-- CMAKE_C_FLAGS:  -Wall -fno-semantic-interposition -msse4.2  -fdiagnostics-color=always  -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized
-- CMAKE_CXX_FLAGS:  -Wno-noexcept-type  -Wall -fno-semantic-interposition -msse4.2  -fdiagnostics-color=always  -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized
-- Generator: Unix Makefiles
-- Build output directory: /var/task/arrow/python/build/temp.linux-x86_64-cpython-312/release
-- Found Python3: /var/lang/bin/python3 (found version "3.12.0") found components: Interpreter Development.Module NumPy 
-- Found Python3Alt: /var/lang/bin/python3  
CMake Error at CMakeLists.txt:268 (find_package):
  By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Arrow", but
  CMake did not find one.

  Could not find a package configuration file provided by "Arrow" with any of
  the following names:

    ArrowConfig.cmake
    arrow-config.cmake

  Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
  "Arrow_DIR" to a directory containing one of the above files.  If "Arrow"
  provides a separate development package or SDK, be sure it has been
  installed.

-- Configuring incomplete, errors occurred!
See also "/var/task/arrow/python/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeOutput.log".
error: command '/usr/bin/cmake' failed with exit code 1
The command '/bin/sh -c pip3 install -r arrow/python/requirements-wheel-build.txt &&     pushd arrow/python &&     python3 setup.py build_ext --build-type=release --bundle-arrow-cpp         bdist_wheel --dist-dir /app/output &&     popd' returned a non-zero code: 1

Dockerfile:

FROM public.ecr.aws/lambda/python:3.12 AS build

RUN dnf upgrade && \
    dnf install -y \
      gcc-c++ \
      git ca-certificates \
      python-setuptools \
      cmake \
      pkg-config \
      python3-devel \
      python3-pip

RUN git clone --depth 1 -b apache-arrow-14.0.1 https://github.com/apache/arrow.git

# This is the folder where we will install the Arrow libraries during development
RUN mkdir dist
ENV ARROW_HOME=$(pwd)/dist
ENV LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
ENV CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH

RUN mkdir arrow/cpp/build && \
    pushd arrow/cpp/build && \
    cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
        -DCMAKE_INSTALL_LIBDIR=lib \
        -DCMAKE_BUILD_TYPE=Release \
        -DARROW_BUILD_TESTS=OFF \
        -DARROW_COMPUTE=OFF \
        -DARROW_CSV=OFF \
        -DARROW_DATASET=ON \
        -DARROW_FILESYSTEM=ON \
        -DARROW_HDFS=OFF \
        -DARROW_JSON=OFF \
        -DARROW_PARQUET=ON \
        -DARROW_WITH_BROTLI=OFF \
        -DARROW_WITH_BZ2=OFF \
        -DARROW_WITH_LZ4=OFF \
        -DARROW_WITH_SNAPPY=ON \
        -DARROW_WITH_ZLIB=OFF \   
        -DARROW_WITH_ZSTD=OFF \
        -DPARQUET_REQUIRE_ENCRYPTION=OFF \
        .. && \
    make -j4 && \
    make install && \
    popd

ENV PYARROW_WITH_PARQUET=1
ENV PYARROW_WITH_DATASET=1
ENV PYARROW_PARALLEL=4
ENV PYARROW_INSTALL_TESTS=0

# This is where it fails:
RUN pip3 install -r arrow/python/requirements-wheel-build.txt && \
    pushd arrow/python && \
    python3 setup.py build_ext --build-type=release --bundle-arrow-cpp \
        bdist_wheel --dist-dir /app/output && \
    popd

FROM public.ecr.aws/lambda/python:3.12

COPY --from=build /app/output /app/output
COPY . ${LAMBDA_TASK_ROOT}

RUN dnf install -y gcc-c++ && \
    pip install pyarrow --no-index --find-links file:////app/output && \
    pip install --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

CMD ["main.handler"]

Is there another way to deploy a Lambda function containing snowflake-connector-python==3.5.0, pandas and pyarrow without exceeding the size limit?

PS: I've tried building from PR #34234 as suggested on issue #34240 , but got the same result.

Component(s)

Python

kou commented 11 months ago

Could you also show the build log of the following part?

RUN mkdir arrow/cpp/build && \
    pushd arrow/cpp/build && \
    cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
        -DCMAKE_INSTALL_LIBDIR=lib \
        -DCMAKE_BUILD_TYPE=Release \
        -DARROW_BUILD_TESTS=OFF \
        -DARROW_COMPUTE=OFF \
        -DARROW_CSV=OFF \
        -DARROW_DATASET=ON \
        -DARROW_FILESYSTEM=ON \
        -DARROW_HDFS=OFF \
        -DARROW_JSON=OFF \
        -DARROW_PARQUET=ON \
        -DARROW_WITH_BROTLI=OFF \
        -DARROW_WITH_BZ2=OFF \
        -DARROW_WITH_LZ4=OFF \
        -DARROW_WITH_SNAPPY=ON \
        -DARROW_WITH_ZLIB=OFF \   
        -DARROW_WITH_ZSTD=OFF \
        -DPARQUET_REQUIRE_ENCRYPTION=OFF \
        .. && \
    make -j4 && \
    make install && \
    popd
bascheibler commented 11 months ago

Sure, here it is: https://pastebin.com/dvTAYhy9 Given that the part you asked for generated over 1,000 lines, I've decided to share the entire log in an external link.

Please let me know if there's any additional info that I could provide to support debugging this issue.

kou commented 11 months ago

Thanks.

-- Installing: /var/task/arrow/cpp/build/$(pwd)/dist/lib/cmake/Arrow/ArrowConfig.cmake

is the problem. $(pwd) isn't expanded. Could you use a static path instead of $(pwd)?

kou commented 10 months ago

No update. Can we close this as stalled?

bascheibler commented 7 months ago

Sorry for the late response. Yes, please - feel free to close this issue. Thank you for pointing out the $(pwd) typo.