apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[Python][Skyhook] the code not include the "libarrow_skyhook_client.so" library #35357

Open shou123 opened 1 year ago

shou123 commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

According to the paper mention: import pyarrow.dataset as ds format_ = ds.SkyhookFileFormat( "parquet", "/ceph.conf" )

But for the skyhook build, when set up "ARROW_SKYHOOK=ON", there is no "libarrow_skyhook_client.so" library generate. SkyhookFileFormat API can not be used.

Component(s)

Packaging, Python

kou commented 1 year ago

Skyhook's library name is libarrow_skyhook.so not libarrow_skyhook_client.so. Why do you think that the name is libarrow_skyhook_client.so.

Cc: @JayjeetAtGithub

shou123 commented 1 year ago

Skyhook's library name is libarrow_skyhook.so not libarrow_skyhook_client.so. Why do you think that the name is libarrow_skyhook_client.so.

Cc: @JayjeetAtGithub

'libarrow_skyhook_client.so' is support to using 'SkyhookFileFormat' API which is link 'arrow_dataset', 'arrow', and 'arrow_skyhook_client' shared libraries during compiling.

Reference: [https://jayjeetc.medium.com/skyhookdm-is-now-a-part-of-apache-arrow-e5d7b9a810ba]

kou commented 1 year ago

The article was written by @JayjeetAtGithub . So we should wait for a response from @JayjeetAtGithub . :-)

shou123 commented 1 year ago

The article was written by @JayjeetAtGithub . So we should wait for a response from @JayjeetAtGithub . :-)

For Sure.

Rajneesh2223 commented 1 year ago

The libarrow_skyhook_client.so library is generated by the arrow_skyhook_client target in the pyarrow/cpp/build/BUILD.gn file. This target is only enabled when ARROW_SKYHOOK=ON is set.

The SkyhookFileFormat API is implemented in the arrow/ipc/skyhook.cc file. This file only includes the libarrow_skyhook_client.so library if it is available.

When ARROW_SKYHOOK=ON is not set, the libarrow_skyhook_client.so library is not generated, and the SkyhookFileFormat API is not available.

shou123 commented 1 year ago

The libarrow_skyhook_client.so library is generated by the arrow_skyhook_client target in the pyarrow/cpp/build/BUILD.gn file. This target is only enabled when ARROW_SKYHOOK=ON is set.

Sorry, I didn't find the "pyarrow/cpp/build/BUILD.gn" file in the aparche arrow source code. Could you please help provide a source code link?

shou123 commented 1 year ago

The libarrow_skyhook_client.so library is generated by the arrow_skyhook_client target in the pyarrow/cpp/build/BUILD.gn file. This target is only enabled when ARROW_SKYHOOK=ON is set.

The SkyhookFileFormat API is implemented in the arrow/ipc/skyhook.cc file. This file only includes the libarrow_skyhook_client.so library if it is available.

When ARROW_SKYHOOK=ON is not set, the libarrow_skyhook_client.so library is not generated, and the SkyhookFileFormat API is not available.

PS, I also set the 'ARROW_SKYHOOK=ON' and according to the paper: 'https://arxiv.org/pdf/2204.06074.pdf" paper, the pyarrow need to include a function named 'SkyhookFileFormat'. But it is not include this function at 'pyarrow ' library.

westonpace commented 1 year ago

How are you installing Arrow today? I think we might not be enabling skyhook in the wheels that we publish to pypi / conda-forge. So you will have to build wheels from source. Directions on how to do this are here: https://arrow.apache.org/docs/developers/python.html

shou123 commented 1 year ago

How are you installing Arrow today? I think we might not be enabling skyhook in the wheels that we publish to pypi / conda-forge. So you will have to build wheels from source. Directions on how to do this are here: https://arrow.apache.org/docs/developers/python.html

Thank you for providing the information. I'll try for that.

drin commented 3 months ago

@shou123 , I know this is quite late, but did you manage to figure out this particular issue?

Based on #37866 being opened, I am assuming so.