apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.49k stars 3.52k forks source link

[Python] Getting reference not found with ORC enabled pyarrow #18439

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Generated the pyarrow with OCR enabled on Power using following steps:


export ARROW_HOME=$CONDA_PREFIX
mkdir cpp/build
cd cpp/build
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
      -DCMAKE_INSTALL_LIBDIR=lib \
      -DARROW_WITH_BZ2=ON \
      -DARROW_WITH_ZLIB=ON \
      -DARROW_WITH_ZSTD=ON \
      -DARROW_WITH_LZ4=ON \
      -DARROW_WITH_SNAPPY=ON \
      -DARROW_WITH_BROTLI=ON \
      -DARROW_PARQUET=ON \
      -DARROW_PYTHON=ON \
      -DARROW_BUILD_TESTS=ON \
      -DARROW_CUDA=ON \
      -DCUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs/libcuda.so \
      -DARROW_ORC=ON \
      ..

make -j
make install

cd ../../python
python setup.py build_ext --bundle-arrow-cpp --with-orc --with-cuda --with-parquet bdist_wheel

 

 

With the generated whl package installed, ran CUDF tests and observed following error:

_ERROR cudf - ImportError: /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so: undefined symbol: ZN5arrow8adapters3orc13OR...

Please find the whole error log below:

================================================================================ ERRORS ================================================================================

____ ERROR collecting test session _____ /conda/envs/rmm/lib/python3.7/importlib/init.py:127: in import_module     return _bootstrap._gcd_import(name[level:], package, level)

:1006: in _gcd_import     ??? :983: in _find_and_load     ??? :953: in _find_and_load_unlocked     ??? :219: in _call_with_frames_removed     ??? :1006: in _gcd_import     ??? :983: in _find_and_load     ??? :953: in _find_and_load_unlocked     ??? :219: in _call_with_frames_removed     ??? :1006: in _gcd_import     ??? :983: in _find_and_load     ??? :967: in _find_and_load_unlocked     ??? :677: in _load_unlocked     ??? :728: in exec_module     ??? :219: in _call_with_frames_removed     ??? cudf/cudf/__init__.py:60: in     from cudf.io import ( cudf/cudf/io/__init__.py:8: in     from cudf.io.orc import read_orc, read_orc_metadata, to_orc cudf/cudf/io/orc.py:6: in     from pyarrow import orc as orc /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/orc.py:24: in     import pyarrow._orc as _orc E   ImportError: /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so: undefined symbol: _ZN5arrow8adapters3orc13ORCFileReader4ReadEPSt10shared_ptrINS_5TableEE ======================================================================= short test summary info ======================================================================== **_ERROR cudf - ImportError: /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so: undefined symbol: _ZN5arrow8adapters3orc13OR..._** !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!![ Interrupted: 1 error during collection ]( Interrupted: 1 error during collection )!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! =========================================================================== 1 error in 1.54s =========================================================================== Fatal Python error: Segmentation fault **Environment**: PPC64LE **Reporter**: [Kandarpa](https://issues.apache.org/jira/browse/ARROW-11075) #### Original Issue Attachments: - [arrow_cpp_build.log](https://issues.apache.org/jira/secure/attachment/13018472/arrow_cpp_build.log) - [arrow_python_build.log](https://issues.apache.org/jira/secure/attachment/13018470/arrow_python_build.log) - [conda_list.txt](https://issues.apache.org/jira/secure/attachment/13018473/conda_list.txt) - [cudf_buildscrip.sh](https://issues.apache.org/jira/secure/attachment/13020173/cudf_buildscrip.sh) **Note**: *This issue was originally created as [ARROW-11075](https://issues.apache.org/jira/browse/ARROW-11075). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*
asfimport commented 3 years ago

Uwe Korn / @xhochy: Can you post the output of conda list and the build logs for the C++ and Python part of Arrow? Without these three it will be hard to debug.

asfimport commented 3 years ago

Kandarpa: Hello @xhochy

Please find following:

Conda list conda_list.txt
- - -
Arrow cpp build logs

This includes cmake, make, make install
arrow_cpp_build.log
Arrow python build logs arrow_python_build.log



Please let me know if you need any further information.

Regards,

Kandarpa

 |

asfimport commented 3 years ago

Kandarpa: @xhochy

Any update on this, we are kind of blocked with this issue.

asfimport commented 3 years ago

Uwe Korn / @xhochy: I would guess that the issue is related to -DORC_SOURCE=BUNDLED and having orc installed as a conda package at the same time. Can you remove the -DORC_SOURCE=BUNDLED flag and do a clean build? Do you know why you have set that?

asfimport commented 3 years ago

Wes McKinney / @wesm: ORC is supported to be statically linked, so this would be unusual.

[~kandarpamalipeddi] can you show what ORC symbols are in your shared library?


nm -D /path/to/libarrow.so | c++filt | grep orc

Check also which libarrow.so the pyarrow libraries are linking to if you can (with ldd)

asfimport commented 3 years ago

Uwe Korn / @xhochy: The latest ORC release is supporting shared linkage and the conda toolchain has been reworked to link dynamically: https://github.com/conda-forge/arrow-cpp-feedstock/blob/1.0.x/recipe/meta.yaml. The major issue here is probably that ORC 0.6.2 is built as part of the Arrow thirdparty toolchain but 0.6.6 headers are used during the build. Not sure how this links but that feels like the most likely issue to me.

asfimport commented 3 years ago

Kandarpa: Hello @xhochy, @wesm, thanks for looking into this issue.

 

Ran cmake as following :Ran cmake as following :

 

 


#cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME
       -DCMAKE_INSTALL_LIBDIR=lib
       -DARROW_WITH_BZ2=ON
       -DARROW_WITH_ZLIB=ON
       -DARROW_WITH_ZSTD=ON
       -DARROW_WITH_LZ4=ON
       -DARROW_WITH_SNAPPY=ON
       -DARROW_WITH_BROTLI=ON
       -DARROW_PARQUET=ON
       -DARROW_PYTHON=ON
       -DARROW_BUILD_TESTS=ON
       -DARROW_CUDA=ON
       -DCUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs/libcuda.so
       -DARROW_ORC=ON
       -DARROW_JEMALLOC=ON
       -DARROW_DATASET=ON
       ..
#make -j

nm -D ./release/libarrow.so | c++filt | grep orc 0000000000bf21d0 u guard variable for arrow::adapters::orc::ArrowInputFile::getName[abi:cxx11]() const::filename U orc::ParseError::ParseError(char const*) U orc::ParseError::ParseError(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) U orc::ParseError::~ParseError() U orc::InputStream::~InputStream() U orc::createReader(std::unique_ptr<orc::InputStream, std::default_delete >, orc::ReaderOptions const&) U orc::ReaderOptions::ReaderOptions() U orc::ReaderOptions::~ReaderOptions() U orc::RowReaderOptions::includeTypes(std::cxx11::list<unsigned long, std::allocator > const&) U orc::RowReaderOptions::range(unsigned long, unsigned long) U orc::RowReaderOptions::RowReaderOptions(orc::RowReaderOptions const&) U orc::RowReaderOptions::RowReaderOptions() U orc::RowReaderOptions::~RowReaderOptions() 0000000000474110 T arrow::io::internal::LibHdfsShim::BuilderSetForceNewInstance(hdfsBuilder*) 00000000009cfda0 T arrow::adapters::orc::AppendBatch(orc::Type const*, orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009cb720 T arrow::adapters::orc::GetArrowType(orc::Type const*, std::shared_ptr*) 00000000009c2ad0 T arrow::adapters::orc::ORCFileReader::ReadSchema(std::shared_ptr*) 00000000009c4140 T arrow::adapters::orc::ORCFileReader::ReadStripe(long, std::shared_ptr*) 00000000009c4690 T arrow::adapters::orc::ORCFileReader::ReadStripe(long, std::vector<int, std::allocator > const&, std::shared_ptr*) 00000000009c2a80 T arrow::adapters::orc::ORCFileReader::NumberOfRows() 00000000009c2a50 T arrow::adapters::orc::ORCFileReader::NumberOfStripes() 00000000009c4e50 T arrow::adapters::orc::ORCFileReader::NextStripeReader(long, std::shared_ptr*) 00000000009c4f60 T arrow::adapters::orc::ORCFileReader::NextStripeReader(long, std::vector<int, std::allocator > const&, std::shared_ptr*) 00000000009c2ba0 T arrow::adapters::orc::ORCFileReader::Open(std::shared_ptr const&, arrow::MemoryPool*, std::unique_ptr<arrow::adapters::orc::ORCFileReader, std::default_delete >*) 00000000009c32a0 T arrow::adapters::orc::ORCFileReader::Read(std::shared_ptr)* 00000000009c3630 T arrow::adapters::orc::ORCFileReader::Read(std::shared_ptr const&, std::shared_ptr)* 00000000009c3d80 T arrow::adapters::orc::ORCFileReader::Read(std::shared_ptr const&, std::vector<int, std::allocator > const&, std::shared_ptr)* 00000000009c3760 T arrow::adapters::orc::ORCFileReader::Read(std::vector<int, std::allocator > const&, std::shared_ptr*) 00000000009c2810 T arrow::adapters::orc::ORCFileReader::Seek(long) 00000000009c2600 T arrow::adapters::orc::ORCFileReader::ORCFileReader() 00000000009c2600 T arrow::adapters::orc::ORCFileReader::ORCFileReader() 00000000009c2770 T arrow::adapters::orc::ORCFileReader::~ORCFileReader() 00000000009c2770 T arrow::adapters::orc::ORCFileReader::~ORCFileReader() 00000000009d0fc0 T arrow::adapters::orc::AppendMapBatch(orc::Type const*, orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009ca990 T arrow::adapters::orc::AppendBoolBatch(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d0770 T arrow::adapters::orc::AppendListBatch(orc::Type const*, orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d1610 W arrow::Status arrow::adapters::orc::AppendBinaryBatch(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d1df0 W arrow::Status arrow::adapters::orc::AppendBinaryBatch(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d0490 T arrow::adapters::orc::AppendStructBatch(orc::Type const*, orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009cfa60 T arrow::adapters::orc::AppendDecimalBatch(orc::Type const*, orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009cb160 T arrow::adapters::orc::AppendTimestampBatch(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009cf7d0 T arrow::adapters::orc::AppendFixedBinaryBatch(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d46c0 W arrow::Status arrow::adapters::orc::AppendNumericBatchCast<arrow::NumericBuilder, int, orc::LongVectorBatch, long>(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d3770 W arrow::Status arrow::adapters::orc::AppendNumericBatchCast<arrow::NumericBuilder, signed char, orc::LongVectorBatch, long>(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d3fe0 W arrow::Status arrow::adapters::orc::AppendNumericBatchCast<arrow::NumericBuilder, float, orc::DoubleVectorBatch, double>(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d3050 W arrow::Status arrow::adapters::orc::AppendNumericBatchCast<arrow::NumericBuilder, short, orc::LongVectorBatch, long>(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009d29a0 W arrow::Status arrow::adapters::orc::AppendNumericBatchCast<arrow::NumericBuilder, int, orc::LongVectorBatch, long>(orc::ColumnVectorBatch*, long, long, arrow::ArrayBuilder*) 00000000009c50f0 W std::_Sp_counted_ptr<arrow::adapters::orc::OrcStripeReader*, (gnu_cxx::_Lock_policy)2>::_M_destroy() 00000000009c5020 W std::_Sp_counted_ptr<arrow::adapters::orc::OrcStripeReader*, (gnu_cxx::_Lock_policy)2>::_M_dispose() 00000000009c5080 W std::_Sp_counted_ptr<arrow::adapters::orc::OrcStripeReader*, (__gnu_cxx::_Lock_policy)2>::_M_get_deleter(std::type_info const&) 00000000009c72c0 W std::vector<arrow::adapters::orc::StripeInformation, std::allocator >::_M_default_append(unsigned long) U typeinfo for orc::ParseError U typeinfo for orc::InputStream 0000000000be66c8 V typeinfo for arrow::adapters::orc::ArrowInputFile 0000000000be66e0 V typeinfo for arrow::adapters::orc::OrcStripeReader 0000000000be66f8 V typeinfo for std::_Sp_counted_ptr<arrow::adapters::orc::OrcStripeReader*, (gnu_cxx::_Lock_policy)2> 0000000000aa8a58 V typeinfo name for arrow::adapters::orc::ArrowInputFile 0000000000aa8a80 V typeinfo name for arrow::adapters::orc::OrcStripeReader 0000000000aa8aa8 V typeinfo name for std::_Sp_counted_ptr<arrow::adapters::orc::OrcStripeReader*, (gnu_cxx::_Lock_policy)2> 0000000000be6710 V vtable for arrow::adapters::orc::ArrowInputFile 0000000000be6750 V vtable for arrow::adapters::orc::OrcStripeReader 0000000000be6780 V vtable for std::_Sp_counted_ptr<arrow::adapters::orc::OrcStripeReader*, (__gnu_cxx::_Lock_policy)2> 0000000000bf21d8 u arrow::adapters::orc::ArrowInputFile::getName[abi:cxx11]() const::filename

 

Looks like, namespace issue?

asfimport commented 3 years ago

Kandarpa: @xhochy, @wesm

Any pointer on this. I am totally blocked with this.  Any workaround is really appreciated.

 

Regards,

Kandarpa

 

asfimport commented 3 years ago

Kandarpa: Hello @xhochy @wesm,

  Any update on this ?

 

Regards,

Kandarpa

asfimport commented 3 years ago

Uwe Korn / @xhochy: Can you provide a reproducible dockerfile or similar? I fail to see anything obvious here.

asfimport commented 3 years ago

Kandarpa: @xhochy

Please find the build steps in the attachments. cudf_buildscrip.sh

Let me know if you need any further details.

 

Kandarpa

asfimport commented 3 years ago

Kandarpa: Hello  @xhochy @wesm,

 

Any update or observations on this ?

 

Regards,

Kandarpa