apache / incubator-heron

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter
https://heron.apache.org/
Apache License 2.0
3.65k stars 598 forks source link

Adding Bazel Platforms support #3779

Closed nicknezis closed 2 years ago

nicknezis commented 2 years ago

Some build script cleanup leveraging the Bazel platforms feature.

https://github.com/aosp-riscv/platform-build-bazel

Context below:

While working through a bug we ran into with MacOS failing to compile, I got a better understanding of some of the Bazel settings. In another thread with Josh, who is compiling on the Apple M1 for the first time, he ran into issues with Linux vs OSX.

It dawned on me that we were explicitly setting the config to modify a number of settings. Some of them include the following items:

  1. Platform libraries (Linux C libraries)
  2. Platform + CPU decision on which binary to use (Helm binary)
  3. OS decision (MacOS unique steps vs Linux as the default)
  4. Optimization setting (We just hardcoded to C03, but Bazel reported as "fastbuild")
  5. Stylecheck toggle (darwin vs darwin_nostyle)

In a recent PR, I collapsed all of the Linux flavors to a standard --config=linux. This was born out of my frustration with the build scripts not jiving with Bazel standards. But it was not actually solving the problem. I'm hoping that this PR will help resolve some of the issues. 

Changes: 

  1. No longer need to set --config. Bazel will detect the OS automatically.
  2. Default is fastbuild, but you can add -c opt to build optimized build.  a. You can even build both side by side with Bazel making a folder like darwin-fastbuild and darwin-opt3. If you want to run stylecheck, you can add --config=stylecheck
  3. The Linux libraries (-lm, -lpthread, -lrt) are included in tools/platform/BUILD
  4. As we add support for ARM, we can add more options in tools/platform/BUILD6. Scripts outside of Bazel were updated to remove --config.
  5. Scripts outside of Bazel will specify  -c opt or --config=stylecheck when needed.
nicknezis commented 2 years ago

I decided to remove the Travis CI special settings because some of them were outdated settings which were needed to use a modern C++ compiler. We are now on a more modern version of Ubuntu, so it's not needed anymore. Also With the other .bazelrc cleanup, I thought it would be a good opportunity to clean it up.

One of the benefits I see is that the build completed in 1 hr 45 min. This is down from the nearly 3 hour long builds. I'm not sure what the difference was. One theory is maybe it was due to us re-running the style check on each stage of the process (build, non-flaky unit test, flaky unit test). Another theory is that by limiting the memory used and also the jobs to 25, maybe that slowed things down. But I'm taking this opportunity to test TravisCI and it seems like it's working.

surahman commented 2 years ago

One of the benefits I see is that the build completed in 1 hr 45 min.

Yes, this is awesome! 🥳

2. Default is fastbuild, but you can add -c opt to build optimized build.

We should document this in the build script as well as the online documentation?

@joshfischer1108 has this alleviated your build issues on Apple Silicone?

nicknezis commented 2 years ago

We should document this in the build script as well as the online documentation?

Updated the Compiling overview, compiling linux and compiling MacOS pages in the docs. Also found some other issues with old version of Bazel being referenced (0.26). I updated our docs to use Bazelisk, which automatically manages the verzion of Bazel. This will save us from having to update the docs each time we bump the version. This was especially annoying each time because of the old versioned docs in the repo that should not be updated.

nicknezis commented 2 years ago

Some other bazel.rc related cleanup. I removed the tools/docker/bazel.rc file because it was no longer needed. It was limiting the cpu/memory of Bazel when running in Docker. This was primarily due to an issue with JDK locking up due to memory bloat. This is no longer an issue with JDK 11, so we don't need it anymore. Removed the file and removed references to it in the Docker scripts.

Here is an example linked off of the Bazel issue in which they document the fact that JDK 11 resolved many Docker related issues. https://github.com/bazelbuild/bazel/issues/6592

nicknezis commented 2 years ago

I also removed mention of Applatix CI scripts. Applatix was a Kubernetes related hosting platform which was acquired by Intuit in 2018. They seemed to be a copy of our other CI scripts, but were not used by Heron and were not maintained. So they are now purged.

surahman commented 2 years ago

I was just looking through the build logs and I am not sure if I am interpreting what I am seeing correctly but the only set of tests that seem to be running are the integration tests.

#10424 passed ```bash =========================================================== heron build 0:59:59 heron test non-flaky 0:04:48 heron test flaky 0:01:18 heron build tarpkgs 0:01:10 heron build binpkgs 0:00:37 heron build docker images 0:01:12 ===> Finished scripts/travis/build.sh at 2022-02-27 22:33:10 (1:09:06) ===> Starting scripts/travis/test.sh at 2022-02-27 22:33:14 ===> Starting heron build integration_test at 2022-02-27 22:33:14 bazel build integration_test/src/... > heron_build_integration_test.txt 64 seconds 30 log lines `bazel build integration_test/src/...` finished with errcode: 0 ===> Finished heron build integration_test at 2022-02-27 22:34:18 (0:01:04) ===> Starting heron install at 2022-02-27 22:34:18 bazel run -- scripts/packages:heron-install.sh --user > heron_install.txt 18 seconds 34 log lines `bazel run -- scripts/packages:heron-install.sh --user` finished with errcode: 0 ===> Finished heron install at 2022-02-27 22:34:37 (0:00:19) ===> Starting heron tests install at 2022-02-27 22:34:37 bazel run -- scripts/packages:heron-tests-install.sh --user > heron_tests_install.txt 6 seconds 34 log lines `bazel run -- scripts/packages:heron-tests-install.sh --user` finished with errcode: 0 ===> Finished heron tests install at 2022-02-27 22:34:43 (0:00:06) ===> Starting heron integration_test local at 2022-02-27 22:34:43 ``` ```bash ===> Finished heron integration_topology_test java at 2022-02-27 23:12:55 (0:04:36) ===> Task duration summary for scripts/travis/test.sh =========================================================== heron build integration_test 0:01:04 heron install 0:00:19 heron tests install 0:00:06 heron integration_test local 0:05:55 heron integration_test http-server initialization 0:00:00 heron integration_test scala 0:02:45 heron integration_test java 0:18:30 heron integration_test python 0:06:25 heron integration_topology_test java 0:04:36 ===> Finished scripts/travis/test.sh at 2022-02-27 23:12:55 (0:39:41) ===> Task duration summary for scripts/travis/ci.sh =========================================================== scripts/travis/build.sh 1:09:06 scripts/travis/test.sh 0:39:41 ```

Are the "regular"/non-integration battery of tests still run? I may have missed the output in the logs. I remember heron test taking at least 30mins before and logs indicate heron test flaky/non-flaky are taking a combined ~6mins.

nicknezis commented 2 years ago

Are the "regular"/non-integration battery of tests still run? I may have missed the output in the logs. I remember heron test taking at least 30mins before and logs indicate heron test flaky/non-flaky are taking a combined ~6mins.

Interesting. I see them mentioned, but it's hard to tell with the way to log output is hidden away. I'll do some local testing.

surahman commented 2 years ago

I think everything is okay with the tests. I did not see anything in the diff which would turn it off and this PR run log has the same sort of output without bazel test being specifically noted.

surahman commented 2 years ago

Great work on this PR, there are some really important changes. I know some of these changes are cascading and build on/require each other but it would be better to introduce them as multiple small PRs. This way if there is an issue with some of the changes, we can revert the culprits without losing all changes.

I would get @nwangtw to once over and approve the changes before merging. It looks good to me. I have a few comments/questions - none of which would hold anything up on my end.

It would be nice if @joshfischer1108, or someone else with an Apple Silicone machine, could let us know if they are encountering any other build issues.

thinker0 commented 2 years ago
# BigSur-11.6.5 (20G527)
%  clang -v
Apple clang version 12.0.5 (clang-1205.0.22.11)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

% ./docker/scripts/test-unittest.sh darwin 0.20.5
================================================================================
INFO: Elapsed time: 954.681s, Critical Path: 305.35s
INFO: 3328 processes: 1144 internal, 2184 local.
INFO: Build completed successfully, 3328 total actions
Test cases: finished with 1189 passing and 0 failing out of 1189 test cases

Executed 247 out of 247 tests: 247 tests pass.
INFO: Build completed successfully, 3328 total actions
Cleaning up scratch dir
windhamwong commented 2 years ago

Just tested on the ubuntu22.04, and found that, If docker version is lower than (<20.10.10), the apt update will fail. Should we mention in somewhere or in doc to remind people to upgrade?

nicknezis commented 2 years ago

@windhamwong I added a comment above the apt update for both the compile Dockerfile.ubuntu22.04 and the dist Dockerfile.dist.ubuntu22.04.

thinker0 commented 2 years ago
% ./docker/scripts/test-unittest.sh centos7 0.20.5
================================================================================
INFO: Elapsed time: 1583.309s, Critical Path: 234.97s
INFO: 3608 processes: 1160 internal, 2448 local.
INFO: Build completed successfully, 3608 total actions
Test cases: finished with 1189 passing and 0 failing out of 1189 test cases

Executed 247 out of 247 tests: 247 tests pass.
INFO: Build completed successfully, 3608 total actions
Cleaning up scratch dir
thinker0 commented 2 years ago
-ADD bazelrc /root/.bazelrc
./docker/scripts/test-unittest.sh rocky8 0.20.5
================================================================================
INFO: Elapsed time: 1673.469s, Critical Path: 259.39s
INFO: 3608 processes: 1160 internal, 2448 local.
INFO: Build completed successfully, 3608 total actions
Test cases: finished with 1189 passing and 0 failing out of 1189 test cases

Executed 247 out of 247 tests: 247 tests pass.
INFO: Build completed successfully, 3608 total actions
Cleaning up scratch dir
thinker0 commented 2 years ago

+1

Hmm I don't have a Reviewer, so I don't have an Approved button.

joshfischer1108 commented 2 years ago

+1

Hmm I don't have a Reviewer, so I don't have an Approved button.

We need to get you added to the Apache Foundation Github Org as a contributor. Have you done the necessary steps to get this enabled after you were given committer access to the repo?

joshfischer1108 commented 2 years ago

Great work on this PR, there are some really important changes. I know some of these changes are cascading and build on/require each other but it would be better to introduce them as multiple small PRs. This way if there is an issue with some of the changes, we can revert the culprits without losing all changes.

I would get @nwangtw to once over and approve the changes before merging. It looks good to me. I have a few comments/questions - none of which would hold anything up on my end.

It would be nice if @joshfischer1108, or someone else with an Apple Silicone machine, could let us know if they are encountering any other build issues.

I will attempt to build this weekend.

thinker0 commented 2 years ago

https://github.com/apache/incubator-heron/pull/3779#issuecomment-1079049387

Yes, I Want

surahman commented 2 years ago

I will attempt to build this weekend.

Thank you, it would be great to know if all is well on Apple Silicone. Choi Se confirmed everything is building on Apple x86_64.

We need to get you added to the Apache Foundation Github Org as a contributor. Have you done the necessary steps to get this enabled after you were given committer access to the repo?

Let us get this set up so he can merge PRs.

nicknezis commented 2 years ago

I think this is now ready to merge. For the Apple M1 support. I think this currently won't yet support it. But primarily because we need to add some missing ARM dependencies. For example the helm binary is currently x86. But we can add M1 support in a future PR. This PR definitely gets us much closer to being able to support it.

surahman commented 2 years ago

I think this is now ready to merge.

Completely agree and Apple Silicon support should not block this. It has a lot of very broad and required changes.

surahman commented 2 years ago

Resolved a silly merge conflict in the WORKSPACE related to extra whitespace.