apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.68k stars 3.56k forks source link

[Python] Failed to compile ARM64 PyArrow for Windows ARM #44310

Open zhanweiw opened 1 month ago

zhanweiw commented 1 month ago

May I get your support on compiling ARM64 PyArrow. I’m trying to compile it through the below steps on Windows on ARM device. When I run the last line command ‘python setup.py build_ext –inplace’ in step 5 to compile the PyArrow(Python extension), I get many error messages like below:

  python_to_arrow.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: __cdecl arrow::LargeBinaryBuilder::LargeBinaryBuilder(class std::shared_ptr<class arrow::DataType> const &,class arrow::MemoryPool *)" (__imp_??0LargeBinaryBuilder@arrow@@QEAA@AEBV?$shared_ptr@VDataType@arrow@@@std@@PEAVMemoryPool@1@@Z) referenced in function "protected: virtual class arrow::Status __cdecl arrow::internal::PrimitiveConverter<class arrow::LargeBinaryType,class arrow::py::`anonymous namespace'::PyConverter>::Init(class arrow::MemoryPool *)" (?Init@?$PrimitiveConverter@VLargeBinaryType@arrow@@VPyConverter@
?A0xc3172271@py@2@@internal@arrow@@MEAA?AVStatus@3@PEAVMemoryPool@3@@Z) [C:\source\arrow\python\build\temp.win-arm64-cpython-312\arrow_python.vcxproj]

By comparing the ‘DUMPBIN’ result of ‘arrow.lib’ between x64 & arm64, the ARM64 lib missed the function with the parameter 'class std::shared_ptr const &,class arrow::MemoryPool *' in 'arrow.lib':

LargeBinaryBuilder is a template class. Do you have idea why the function missed while compiling it to ARM64? And how to fix it? Thanks in advance!

/// \class LargeBinaryBuilder
/// \brief Builder class for large variable-length binary data
class ARROW_EXPORT LargeBinaryBuilder : public BaseBinaryBuilder<LargeBinaryType> {
public:
  using BaseBinaryBuilder::BaseBinaryBuilder;

  /// \cond FALSE
  using ArrayBuilder::Finish;
  /// \endcond

  Status Finish(std::shared_ptr<LargeBinaryArray>* out) { return FinishTyped(out); }

  std::shared_ptr<DataType> type() const override { return large_binary(); }
};

Due to the reason mentioned in below link, I compiled the Cpp library with clang-cl for ARM64, but compiled the x64 version with MSVC cl. https://arrow.apache.org/docs/developers/cpp/windows.html#building-on-windows-arm64-using-ninja-and-clang

The detailed steps:

1. Install ARM64 Python 3.12.6 and necessary Python extension.

2. Open "ARM64 Native Tools Command Prompt for VS 2022" command line and run below commands:

cd C:\source
git clone https://github.com/apache/arrow.git

3. Add below patch:
Patch 1:
diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt
index c911f0f4e..ddd4dc0bb 100644
--- a/cpp/src/arrow/CMakeLists.txt
+++ b/cpp/src/arrow/CMakeLists.txt
@@ -955,7 +955,7 @@ if(CXX_LINKER_SUPPORTS_VERSION_SCRIPT)
endif()

if(ARROW_BUILD_STATIC AND ARROW_BUNDLED_STATIC_LIBS)
-  set(ARROW_BUILD_BUNDLED_DEPENDENCIES TRUE)
+  set(ARROW_BUILD_BUNDLED_DEPENDENCIES FALSE)
else()
   set(ARROW_BUILD_BUNDLED_DEPENDENCIES FALSE)
endif()

Patch 2:
diff --git a/python/setup.py b/python/setup.py
index 60b9a696d..6cde02919 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -165,7 +165,7 @@ class build_ext(_build_ext):
         _build_ext.initialize_options(self)
         self.cmake_generator = os.environ.get('PYARROW_CMAKE_GENERATOR')
         if not self.cmake_generator and sys.platform == 'win32':
-            self.cmake_generator = 'Visual Studio 15 2017 Win64'
+            self.cmake_generator = 'Visual Studio 17 2022'
         self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '')
         self.build_type = os.environ.get('PYARROW_BUILD_TYPE',
                                          'release').lower()

4. Run commands below to compile arrow Cpp code:

set ARROW_HOME=C:\source\arrow\Install
set CMAKE_PREFIX_PATH=C:\source\arrow\Install

mkdir arrow\cpp\build
pushd arrow\cpp\build

set CC=clang-cl
set CXX=clang-cl

cmake -G "Ninja" -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% -DCMAKE_UNITY_BUILD=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_HDFS=ON -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON .. 
cmake --build . --target install --config Release

popd

5. Run commands below to compile Pybind code:

pushd arrow\python
set PYARROW_BUNDLE_ARROW_CPP=1
python setup.py build_ext --inplace 

Component(s)

C++, Python

kou commented 1 month ago

Could you show the full build log for PyArrow?

FYI: You don't need the second patch by set PYARROW_CMAKE_GENERATOR=Visual Studio 17 2022.

zhanweiw commented 1 month ago

PyArrow.compile.fail.zip

Thanks @kou ! I've attached the PyArrow compiling log together with the x64 & arm64 'arrow.lib' dump log. It seems many functions haven't been compiled into arm64 version 'arrow.lib'.

kou commented 1 month ago

It seems that Arrow C++ uses clang-cl but PyArrow doesn't use clang-cl. Can we use clang-cl for PyArrow too?

zhanweiw commented 1 month ago

Thanks for your suggestion.

After modify the code as below and compile again:

diff --git a/python/setup.py b/python/setup.py
index 60b9a696d..b75adb0fa 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -165,7 +165,7 @@ class build_ext(_build_ext):
         _build_ext.initialize_options(self)
         self.cmake_generator = os.environ.get('PYARROW_CMAKE_GENERATOR')
         if not self.cmake_generator and sys.platform == 'win32':
-            self.cmake_generator = 'Visual Studio 15 2017 Win64'
+            self.cmake_generator = 'Ninja'

I got the link error below. After remove the content of function 'arrow::gdb::TestSession', I can compile the PyArrow successfully. I'll test if the basic functions work.

[53/72] Linking CXX shared library arrow_python.dll
FAILED: arrow_python.dll arrow_python.lib
C:\windows\system32\cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_dll --intdir=CMakeFiles\arrow_python.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\arm64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\arm64\mt.exe --manifests  -- C:\PROGRA~1\LLVM\bin\lld-link.exe /nologo CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\arrow_to_pandas.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\benchmark.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\common.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\datetime.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\decimal.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\deserialize.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\extension_type.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\gdb.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\helpers.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\inference.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\io.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\ipc.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\numpy_convert.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\numpy_init.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\numpy_to_arrow.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\python_test.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\python_to_arrow.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\pyarrow.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\serialize.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\udf.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\csv.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\filesystem.cc.obj  /out:arrow_python.dll /implib:arrow_python.lib /pdb:arrow_python.pdb /dll /version:0.0 /machine:ARM64  /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO  C:\source\arrow\Install\lib\arrow_dataset.lib  C:\source\arrow\Install\lib\arrow_acero.lib  C:\source\arrow\Install\lib\parquet.lib  C:\source\arrow\Install\lib\arrow.lib  ws2_32.lib  C:\Programs\Python\Python312-arm64\libs\python312.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
LINK: command "C:\PROGRA~1\LLVM\bin\lld-link.exe /nologo CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\arrow_to_pandas.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\benchmark.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\common.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\datetime.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\decimal.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\deserialize.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\extension_type.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\gdb.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\helpers.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\inference.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\io.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\ipc.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\numpy_convert.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\numpy_init.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\numpy_to_arrow.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\python_test.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\python_to_arrow.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\pyarrow.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\serialize.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\udf.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\csv.cc.obj CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\filesystem.cc.obj /out:arrow_python.dll /implib:arrow_python.lib /pdb:arrow_python.pdb /dll /version:0.0 /machine:ARM64 /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO C:\source\arrow\Install\lib\arrow_dataset.lib C:\source\arrow\Install\lib\arrow_acero.lib C:\source\arrow\Install\lib\parquet.lib C:\source\arrow\Install\lib\arrow.lib ws2_32.lib C:\Programs\Python\Python312-arm64\libs\python312.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: error: undefined symbol: __declspec(dllimport) public: __cdecl arrow::TimeScalar<class arrow::Time32Type>::TimeScalar<class arrow::Time32Type>(int, enum arrow::TimeUnit::type)
>>> referenced by CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\gdb.cc.obj:(void __cdecl arrow::gdb::TestSession(void))
>>> referenced by CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\gdb.cc.obj:(void __cdecl arrow::gdb::TestSession(void))

lld-link: error: undefined symbol: __declspec(dllimport) public: __cdecl arrow::TimeScalar<class arrow::Time64Type>::TimeScalar<class arrow::Time64Type>(__int64, enum arrow::TimeUnit::type)
>>> referenced by CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\gdb.cc.obj:(void __cdecl arrow::gdb::TestSession(void))
>>> referenced by CMakeFiles\arrow_python.dir\pyarrow\src\arrow\python\gdb.cc.obj:(void __cdecl arrow::gdb::TestSession(void))
kou commented 1 month ago

OK. Could you open a PR that updates https://arrow.apache.org/docs/developers/python.html ? We need to add https://arrow.apache.org/docs/developers/cpp/windows.html#building-on-windows-arm64-using-ninja-and-clang like document to there.

For the cmake_generator change: We can use set PYARROW_CMAKE_GENERATOR=Ninja as I mentioned at https://github.com/apache/arrow/issues/44310#issuecomment-2395321324 .

For the link error: Could you try the following?

diff --git a/cpp/src/arrow/scalar.h b/cpp/src/arrow/scalar.h
index 7a273c46c1..a4fd9453ef 100644
--- a/cpp/src/arrow/scalar.h
+++ b/cpp/src/arrow/scalar.h
@@ -464,7 +464,7 @@ struct ARROW_EXPORT Date64Scalar : public DateScalar<Date64Type> {
 };

 template <typename T>
-struct ARROW_EXPORT TimeScalar : public TemporalScalar<T> {
+struct TimeScalar : public TemporalScalar<T> {
   using TemporalScalar<T>::TemporalScalar;

   TimeScalar(typename TemporalScalar<T>::ValueType value, TimeUnit::type unit)
zhanweiw commented 1 month ago

@kou I can compile it successfully by below steps. Need modify the code to disable 'ARROW_BUILD_BUNDLED_DEPENDENCIES'. And also need to add a PyArror version information 'version="17.0.0"' in 'setup.py': If not disable this, will get this error('arrow_bundled_dependencies.lib' can't be found, it hasn't be compiled.):

[190/191] Install the project...-- Install configuration: "RELEASE"
-- Installing: C:/zhanweiw/source/Python/Src/arrow/Install/include/arrow/util/config.h
CMake Error at src/arrow/cmake_install.cmake:40 (file):
  file INSTALL cannot find
  "C:/zhanweiw/source/Python/Src/arrow/cpp/build/release/arrow_bundled_dependencies.lib":
  File exists.
Call Stack (most recent call first):
  cmake_install.cmake:37 (include)

Steps to compile PyArrow:

  1. Install compile environment: a. Install ARM64 Python 3.12.6 and necessary Python extension. b. Install LLVM(https://github.com/llvm/llvm-project/releases/download/llvmorg-18.1.8/LLVM-18.1.8-woa64.exe). c. Visual Studio(Enable ARM64 support). d. Cmake.

  2. Open "ARM64 Native Tools Command Prompt for VS 2022" command line and run below commands:

    cd C:\source
    git clone https://github.com/apache/arrow.git

Add below patch:

diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt
index c911f0f4e..ddd4dc0bb 100644
--- a/cpp/src/arrow/CMakeLists.txt
+++ b/cpp/src/arrow/CMakeLists.txt
@@ -955,7 +955,7 @@ if(CXX_LINKER_SUPPORTS_VERSION_SCRIPT)
 endif()

 if(ARROW_BUILD_STATIC AND ARROW_BUNDLED_STATIC_LIBS)
-  set(ARROW_BUILD_BUNDLED_DEPENDENCIES TRUE)
+  set(ARROW_BUILD_BUNDLED_DEPENDENCIES FALSE)
 else()
   set(ARROW_BUILD_BUNDLED_DEPENDENCIES FALSE)
 endif()

diff --git a/cpp/src/arrow/scalar.h b/cpp/src/arrow/scalar.h
index 7a273c46c..a4fd9453e 100644
--- a/cpp/src/arrow/scalar.h
+++ b/cpp/src/arrow/scalar.h
@@ -464,7 +464,7 @@ struct ARROW_EXPORT Date64Scalar : public DateScalar<Date64Type> {
 };

 template <typename T>
-struct ARROW_EXPORT TimeScalar : public TemporalScalar<T> {
+struct TimeScalar : public TemporalScalar<T> {
   using TemporalScalar<T>::TemporalScalar;

   TimeScalar(typename TemporalScalar<T>::ValueType value, TimeUnit::type unit)

diff --git a/python/setup.py b/python/setup.py
index 60b9a696d..36c3afa9f 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -400,6 +400,7 @@ setup(
     distclass=BinaryDistribution,
     # Dummy extension to trigger build_ext
     ext_modules=[Extension('__dummy__', sources=[])],
+    version="17.0.0",
     cmdclass={
         'build_ext': build_ext
     },
  1. Run commands below to compile arrow Cpp code:
    
    set ARROW_HOME=C:\source\arrow\Install
    set CMAKE_PREFIX_PATH=C:\source\arrow\Install
    set PYARROW_CMAKE_GENERATOR=Ninja

mkdir arrow\cpp\build pushd arrow\cpp\build

set CC=clang-cl set CXX=clang-cl

cmake -G "Ninja" -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% -DCMAKE_UNITY_BUILD=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_HDFS=ON -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON ..

cmake --build . --target install --config Release

popd


4. Run commands below to compile PyArrow:

pushd arrow\python set PYARROW_BUNDLE_ARROW_CPP=1 python setup.py build_ext --inplace python setup.py bdist_wheel popd

kou commented 1 month ago

We can work on the second diff (cpp/src/arrow/scalar.h) in #44364.

Could you open a new issue for the first diff (cpp/src/arrow/CMakeLists.txt)? Let's work on it as a separated task.

Why do we need the third diff (python/setup.py)?

zhanweiw commented 1 month ago

Without the third diff, I'll get 'pyarrow-0-cp312-cp312-win_arm64.whl'. With the change, I'll get 'pyarrow-17.0.0-cp312-cp312-win_arm64.whl'.

zhanweiw commented 1 month ago

I've created new ticket for building 'arrow_bundled_dependencies.lib' issue: https://github.com/apache/arrow/issues/44368

kou commented 1 month ago

Without the third diff, I'll get 'pyarrow-0-cp312-cp312-win_arm64.whl'. With the change, I'll get 'pyarrow-17.0.0-cp312-cp312-win_arm64.whl'.

Hmm. It's strange.

https://github.com/apache/arrow/blob/dcc1ee5b1d4851870724ab5e4cf475bcac007b56/python/pyproject.toml#L79-L84

should be used for version information.

@jorisvandenbossche Do you have any idea why this was happen? ("the third diff" is a diff in https://github.com/apache/arrow/issues/44310#issuecomment-2402288125 .)

zhanweiw commented 1 month ago

@kou

I can't find file 'pyarrow/_generated_version.py' from build:

version_file = 'pyarrow/_generated_version.py'

And it seems no code is using "fallback_version" in ARM64 windows: https://github.com/search?q=repo%3Aapache%2Farrow+fallback_version&type=code

kou commented 1 month ago

I think that pyarrow/_generated_version.py is generated automatically.

I think that fallback_version is used by "setuptools" not PyArrow.

Could you try python -m pip install . instead of python setup.py build_ext --inplace and python setup.py bdist_wheel?

zhanweiw commented 1 month ago

I think that pyarrow/_generated_version.py is generated automatically.

I think that fallback_version is used by "setuptools" not PyArrow.

Could you try python -m pip install . instead of python setup.py build_ext --inplace and python setup.py bdist_wheel?

Got below issue with above command:

           [114/334] Compiling C object numpy/_core/_multiarray_tests.cp312-win_arm64.pyd.p/meson-generated__multiarray_tests.c.obj
            FAILED: numpy/_core/_multiarray_tests.cp312-win_arm64.pyd.p/meson-generated__multiarray_tests.c.obj
            "clang-cl" "-Inumpy\_core\_multiarray_tests.cp312-win_arm64.pyd.p" "-Inumpy\_core" "-I..\numpy\_core" "-I..\numpy\_core\src\multiarray" "-I..\numpy\_core\src\npymath" "-Inumpy\_core\include" "-I..\numpy\_core\include" "-I..\numpy\_core\src\common" "-IC:\Programs\Python\Python312-arm64\Include" "-IC:\Users\zhanw\AppData\Local\Temp\pip-install-l3cp_q1s\numpy_e774c5d2837a4788b968dbaa52f05202\.mesonpy-8gh30u70\meson_cpu" "-DNDEBUG" "/MD" "/nologo" "/showIncludes" "/utf-8" "/W2" "/clang:-std=c11" "/O2" "/Gw" "-fno-strict-aliasing" "/clang:-ftrapping-math" "-DNPY_HAVE_CLANG_FPSTRICT" "-DNPY_HAVE_NEON_VFPV4" "-DNPY_HAVE_NEON_FP16" "-DNPY_HAVE_NEON" "-DNPY_HAVE_ASIMD" "-DNPY_INTERNAL_BUILD" "-DHAVE_NPY_CONFIG_H" "-D_FILE_OFFSET_BITS=64" "-D_LARGEFILE_SOURCE=1" "-D_LARGEFILE64_SOURCE=1" "/Fdnumpy\_core\_multiarray_tests.cp312-win_arm64.pyd.p\meson-generated__multiarray_tests.c.pdb" /Fonumpy/_core/_multiarray_tests.cp312-win_arm64.pyd.p/meson-generated__multiarray_tests.c.obj "/c" numpy/_core/_multiarray_tests.cp312-win_arm64.pyd.p/_multiarray_tests.c
            ..\numpy\_core\src\multiarray\_multiarray_tests.c.src(1883,17): error: invalid operand in inline asm: 'fstcw ${0:w}'
             1883 |         __asm__("fstcw %w0" : "=m" (cw));
                  |                 ^
            ..\numpy\_core\src\multiarray\_multiarray_tests.c.src(1883,17): error: unrecognized instruction mnemonic
            <inline asm>(1,2): note: instantiated into assembly here
                1 |         fstcw
                  |         ^
            2 errors generated.