apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.41k stars 3.51k forks source link

[C++][Parquet] Detach thirdparty code from build configuration. #42263

Closed asfimport closed 8 years ago

asfimport commented 9 years ago

The existing repo has source code for third party dependencies checked into the repo. The build system expects those dependencies in a certain place. This enforces that the built library conform to those exact dependencies without customization.

Managing third party dependencies is better handled through a build environment. It allows the library builder more flexibility over dependency versions and locations. It also cleans up the repo from this third party code.

Reporter: Kalon Mills / @kalaxy Assignee: Kalon Mills / @kalaxy

Related issues:

Note: This issue was originally created as PARQUET-267. Please see the migration documentation for further details.

asfimport commented 9 years ago

Kalon Mills / @kalaxy: Github Pull Request: https://github.com/apache/parquet-cpp/pull/10

This pull request builds off of the pull request submitted for https://issues.apache.org/jira/browse/PARQUET-259.

asfimport commented 9 years ago

Kalon Mills / @kalaxy: @nongli I updated the pull request per your comments last week. Did you have any more feedback?

asfimport commented 8 years ago

Kalon Mills / @kalaxy: @nongli or whoever might be maintaining this now days. I think the first hurdle to approaching parquet-cpp right now is the build config. There are now three pull requests on the github repo updating this. It would be great to get one of the merged so that others don't keep reimplementing the same workarounds.

asfimport commented 8 years ago

Nong Li / @nongli: @kalaxy I agree, this is a big barrier to getting started. I think there are few things we need to decide on before we pick one implementation.

I think the high level requirements are:

  1. Easy to build the project out of the box. We want to minimize the steps required from no dev environment to repo cloned, built and tests passing. This needs to work on a large variety of environments.
  2. Easy to update or add new dependencies.

This makes me lean towards not having the developer have maintain the build environment by default. For someone who does a lot of c++ development and has an existing environment they like, they should be able to use that but for someone getting started, we should provide a "sandboxed" environment that just works. Furthermore, the sandbox environment IMO, is the only supported one. In other words, we can for example, upgrade to a new version of glog just by upgrading the sandbox version. If you chose to run a custom glog that is too old, too bad, you have to figure out how to manage your custom env.

asfimport commented 8 years ago

Wes McKinney / @wesm: Per discussion on PARQUET-416; I'll split out the code reorganization into its own patch so we can figure out a plan here. I'm happy using https://github.com/apache/parquet-cpp/pull/10 as a starting point and adding ${LIBNAME_HOME} variables so that one can easily set up a custom toolchain with environment variables.

asfimport commented 8 years ago

Kalon Mills / @kalaxy: @nongli, I agree that a sandboxed env would be very useful for quickstart development. I think, however, that it should be a script which initializes the environment for use, rather than part of the build configuration. Right now, apache/parquet-cpp:master requires download_thirdparty.sh and build_thirdparty.sh to be run as prerequisites to running cmake because cmake relies on specifics.

I'd rather see a cmake configuration that is customizable for determining library locations and then an ./initialize_sandbox.sh <path_to_build_dir> which can download and install everything required by the sandbox as well as configure cmake in that directory with the sandbox libraries. (In fact this would make the travisci build configuration simpler too because travisci could leverage the sandbox env.)

I think this is what @wesm is getting at with his comment above. My pull request here does three, I think important, things.

  1. It removes third party code from the repo.
  2. It makes library discovery more generic.
  3. It decouples the build configuration from the build environment.

With this as a foundation, I think that one can script sandbox creation in a straightforward and decoupled manner.

I guess your point is that with this pull request I'm removing an existing sandbox solution, but I'm hoping that this would be a good stepping stone for a better solution. I don't want to duplicate work that Wes is doing, but if you'd like I can implement a version of initialize_sandbox.sh so that the sandbox feature still exists. @wesm is that sort of the direction you were already headed?

asfimport commented 8 years ago

Wes McKinney / @wesm: That makes sense to me; if we can avoid putting thirdparty libraries in the git repo for the sandbox that would be great. If you want to rebase your cmake changes on my PR https://github.com/apache/parquet-cpp/pull/14 plus an initialize_sandbox.sh that sounds ideal if @nongli agrees. I can do some of the legwork, too, if you prefer.

The only thing we probably shouldn't build in the sandbox is boost.

asfimport commented 8 years ago

Nong Li / @nongli: @kalaxy @wesm Sounds good to me. I'm not particularly tied to how we make the dev env work as long as it meets the goals which I think we all agree on. If the set up scripts exist in some reasonable form, I'm good with that.

wrt to boost, I think if we upgrade to c++ 11, we might not need boost at all.

asfimport commented 8 years ago

Wes McKinney / @wesm: Thrift requires boost so we are probably stuck with it. I'll wait to see what @kalaxy wants to do with the existing patch and we can go from there.

asfimport commented 8 years ago

Kalon Mills / @kalaxy: I was about say the say thing about boost and thrift.

I'll spend some time today trying to rebase my stuff off of Wes's. Hopefully it isn't too difficult and then I can update my pull request.

asfimport commented 8 years ago

Kalon Mills / @kalaxy: I have a new branch where I rebased my cmake changes from the original pull request off of Wes' changes. I also removed the thirdparty lz4 code and added support for downloading it in the download_thirdparty.sh script. This basically keeps the original sandbox in place while removing and specifics for it from the build system.

For the final commit, I also pulled thrift into the thirdparty setup scripts. But I got to wondering, do you think thrift should be provided by the sandbox scripts? Thrift is pretty finicky to build. So it may to a constant thorn to try and maintain a script that can build and install it on multiple platforms. Thrift also has a fair number of dependencies so it complicates describing how to get a dependencies in order. I wonder what your thoughts are. At a minimum, I can pull the last commit off and we'd be back to the same sandbox capabilities as before but with a cleaner build config.

A couple of other things:

Anyway let me know what you think and if you think this is going in a good direction.

asfimport commented 8 years ago

Wes McKinney / @wesm: I can look into getting gcc-4.9 working on Travis; sorry about the hassle there.

The trouble with installing thirdparty in the build directory is that there may be multiple build directories (in the case of out-of-source builds – this makes branch-switching less painful) – but people doing significant development are probably going to want to set the install locations of all the dependencies outside of the sandbox.

I agree that maintaining a Thrift build recipe may be onerous (see an example one here: https://github.com/cloudera/native-toolchain/blob/master/source/thrift/build.sh). As long as we have a consistent pattern for configuring each library install prefix, then that's already a big improvement.

asfimport commented 8 years ago

Kalon Mills / @kalaxy: No worries about gcc-4.9. I was trying to do too much as once, so it may have been a compounding set of problems.

Ah so that's what native-toolchain is. That's pretty nice. With the changes I've made to the build config it would be easy to set up a build that points to a native-toolchain initialized build dir.

Is it good enough to just have instructions for setting up dependencies through native-toolchain and not maintain build recipes here? Or is it better to have build recipes specific to this repo?

asfimport commented 8 years ago

Kalon Mills / @kalaxy: So it looks like native-toolchain doesn't build individual packages on mac so leveraging that isn't doable for my dev env. However I had another alternative thought. I think the instructions could go something like this.

  1. create a build directory
  2. cd
  3. cp -r /thirdparty .
  4. ./thirdparty/download_thirdparty.sh
  5. ./thirdparty/build_thirdparty.sh
  6. THRIFT_HOME=$PWD/thirdparty/installed SNAPPY_HOME=$PWD/thirdparty/installed LZ4_HOME=$PWD/thirdparty/installed cmake
  7. make

This could also be thrown into a script which takes as a parameter. This puts all of the thirdparty libraries right in the build dir so that build envs can be independent in which libraries they use.

Here we still maintain build recipes but that can evolve over time as support for more envs is requested. (It might also be nice to eventually separate the build recipes like done in native-toolchain.)

asfimport commented 8 years ago

Kalon Mills / @kalaxy: I went ahead and added the instructions above to my branch and reworked the travis ci build to follow those instructions.

I've opened pull request 16 and will close 10 shortly. Let me know what you think.

asfimport commented 8 years ago

Wes McKinney / @wesm: I have a patch out to build a subset of the packages in the native-toolchain; you might have some thoughts: https://github.com/cloudera/native-toolchain/pull/8. Basically looking for the equivalent of requirements.txt (for pip) for C++ development.

We should definitely add instructions for using a custom toolchain like the native-toolchain. Personally, I'm going to have 3 projects sharing a common toolchain:

asfimport commented 8 years ago

Kalon Mills / @kalaxy: Do you mean something more specific then the readme in the pull request? I think the readme is pretty clear about how to guide cmake to find dependencies. It should work quite well with pointing at any toolchain. I'm not really sure what to add that wouldn't be super specific.

asfimport commented 8 years ago

Nong Li / @nongli: Issue resolved by pull request 16 https://github.com/apache/parquet-cpp/pull/16