apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.97k stars 3.41k forks source link

Enable a smaller build of just libparquet #38679

Open the80srobot opened 8 months ago

the80srobot commented 8 months ago

Describe the enhancement requested

Hello, I'm trying to add parquet output support to github.com/google/santa and github.com/wowsignal-io/pedro, and it's proving to be a difficult dependency. In Pedro's case, the static build adds ~15 MB to the binary size, and I'd like reduce that for Santa. A second problem is the amount of build-time dependencies, including boost.

For comparison, the rust crate parquet2 builds in 10 seconds and the resulting dylib is only 2 MB. That seems like it should be an achievable goal, if parquet can be separated from arrow and external build-time dependencies.

For build size, unfortunately, it seems like cpp/parquet/ has a lot of dependencies on cpp/arrow/ and it's almost impossible not to build 90% of arrow. For example, column_writer.cc depends on stuff in cpp/arrow/compute/.

For dependency management, it seems difficult to get rid of boost, because thrift depends on it. Given thrift can codegen, it's not obvious to me if it's really needed at build time, but I'm not too familiar with using it in real projects.

I think to get libparquet builds to be reasonably compact and self-contained, I'd need to accomplish the following:

Is a smaller, hermetic build of libparquet something that's on your roadmap?

If not, I would be happy to maintain a reasonable patchset as part of Santa or Pedro, but it's not clear to me how Arrow is internally structured and if there's modularization such that I could try to patch out cross-dependencies. Would you be willing to point me in the right direction?

Component(s)

C++

kou commented 8 months ago

Could you start a discussion about this on dev@arrow.apache.org to hear many opinions as much as possible? See also: https://arrow.apache.org/community/#mailing-lists

pitrou commented 8 months ago

Several things:

1) I'm not sure think making Parquet C++ independent of Arrow is a realistic goal nowadays. We actually moved Parquet C++ into the Arrow monorepo years ago because maintaining it separately was too cumbersome.

2) We include Thrift-generated files in the repo, you don't need to regenerate them.

3) I don't think Thrift headers still have a Boost dependency nowadays, or perhaps only when including https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/processor/TMultiplexedProcessor.h ?

4) "Reduce the amount of arrow code that needs to be build for libparquet" sounds worthwhile if it doesn't introduce too much complexity. This probably implies working on https://github.com/apache/arrow/issues/25025 first