alibaba / BladeDISC

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
Apache License 2.0
814 stars 160 forks source link

To build all with Bazel. #196

Open qiuxiafei opened 2 years ago

qiuxiafei commented 2 years ago

Background

When putting pytorch_blade and tensorflow_blade to open source, BladeDISC's project structure gets more complex. Currently we've already had the following essential directories:

Our Goal and Current Status

As we've discussed for a long time and many times, we're moving to MonoRepo with both open source and internal code repository, and use Bazel to build ALL. Making all in this repo able to build with Bazel would make dependency structure explicit, standard and clean, and help new developers ramp-up smoothly. In ideal status, one could run a universal preparation script once with necessary arguments, and then bazel build or bazel test any target from any component.

But there're some obstacles in the way:

  1. Components are build independently, even with different build tool. For example:
    • tao is build with CMake, while tao_compiler is with Bazel. It worth nothing that, RAL code is built by both side.
    • pytorch_blade is build Bazel wrapped by python Setuptools, while tensorflow_blade has Bazel calling setuptools.
  2. Components usually have their own build script, shell or python, which cuts off Bazel dependency chain.
  3. Free-style preparations in these scripts make Bazelization even harder:

Approaches

1. Build tao with Bazel.

The tao directory is currently built with CMake. Converting CMake to Bazel is non-trivial but still possible. But for code of RAL, which is build both on bridge side and compiler side makes things complex. RAL code is build with CMake on bridge side and with Bazel on compiler side under tf_community directory. The BUILD file of RAL code load tf_community's rules which won't be available on bridge side. Because bridge just has include files and shared libraries of give host tensorflow. https://github.com/alibaba/BladeDISC/blob/a0d60f9f258052c13f1e45f365e451075a7db937/tao_compiler/mlir/xla/ral/BUILD#L3-L5 There maybe several soluctions:

  1. Make that BUILD file neutral and just load standard Bazel rules, so that it could be used for both bridge and compiler side. Same source files can be compiled into different target for each side. It's also possible to setup an option specifying which side is being building, and use select to switch between dependencies from tf_community and host tensorflow.
  2. Only expose filegroup target from RAL directory, each side write cc_library target in BUILD file under their own directory.

2. Extend common_setup.py

If tao is built with Bazel, all DISC components could expose Bazel targets (may or may not be in single workspace)! Upper-level components like pytorch_blade and tf_blade could reference those targets and move on there own building.

common_setup.py is used to do preparations before build symbolic linking and OneDNN installation before building DISC. So when building any component that depends on DISC, common_setup.py should be called in advance: https://github.com/alibaba/BladeDISC/blob/70ecc07449ade0450fbd0f0f58494c38e683c5b3/pytorch_blade/ci_build/build_pytorch_blade.sh#L45 If we extend common_setup.py a little bit, setting environment variables in build_pytorch_blade.py, pytorch_blade will be free from extra build script( as for the relationship of Python setup tools and Bazel, see open questions). If so, why not just make common_setup.py a global setup step for this whole project, like the configure script in tensorflow.

3. Make DISC a Bazel Workspace out of tf_community

We've had pretty many Bazel workspace now, from an achitecture view, it's natural to have a single Bazel workspace for all of tao_compiler/tao/mhlo, which make up DISC. Pulling tao_compiler out of tf_community's workspace is the key to achieve this goal. I have to admit that it not a very urgent task and we may have challenges if many tf_community internal targets are referenced by tao_compiler. IREE has similar works, may that help.

These are just immature thoughts of my own, your comments pls ~

Open Questions

  1. The relationship of Python setup tools and Bazel? pytorch_blade is build Bazel wrapped by python Setuptools, while tensorflow_blade has Bazel calling setuptools.
Orion34-lanbo commented 2 years ago

As for one of our concerns, which we may need to change the current visibility of bazel targets, I did some research for RAL part.

RAL's dependency

As we can find in RAL's BUILD, these targets are needed(test dependency like //tensorflow/core:test_main are excluded).

        "//tensorflow/core:framework",
        "//tensorflow/core:lib",
        "//tensorflow/core:protos_all_cc",
        "//tensorflow/core:stream_executor_headers_lib",
        "//tensorflow/stream_executor",
        "//tensorflow/stream_executor/cuda:cuda_platform",
        "//tensorflow/stream_executor:cuda_platform",
        "//tensorflow/stream_executor:rocm_platform",
        "//tensorflow/stream_executor/rocm:rocm_driver",
        "//tensorflow/stream_executor/rocm:rocm_platform",

As in the current tf community’s code, only //tensorflow/stream_executor/rocm:rocm_driver is not public visible. However //tensorflow/stream_executor:rocm_platform depends on it, and is a public target we already depend on, thus we can replace depend on //tensorflow/stream_executor/rocm:rocm_driver with //tensorflow/stream_executor:rocm_platform.

As we can see from tf_community/tensorflow/stream_executor/rocm/BUILD

cc_library(
    name = "rocm_platform",
    srcs = if_rocm_is_configured(["rocm_platform.cc"]),
    hdrs = if_rocm_is_configured(["rocm_platform.h"]),
    visibility = ["//visibility:public"],
    deps = if_rocm_is_configured([
        ":rocm_driver",

tao compiler dependency

As for tao_compiler part, most of the dependencys are from tensorflow/compiler/mlir/hlo dir, all the targets from this dir are public.

However, as for targets under tensorflow/compiler/xla, the targets are visible for friends. This is the part where we maybe need to change the visibility. As for now, adding visibility change patch is necessary, or we can search for a public target containing the visibility: friends targets.

Orion34-lanbo commented 2 years ago

As we have done in build tao bridge with bazel for cu110 device #231 , we can successfully build tao_bridge for tensorflow-gpu versions. However for cpu or even arm-cpu, aka aarch64, we have encountered a problem comes with mkldnn and acl. Now we will download and build mkldnn and acl related in common_setup.py: https://github.com/alibaba/BladeDISC/blob/d5f085b099a7cfa3dbcdd83d125fcd6c211d69ec/scripts/python/common_setup.py#L257-L313

After this, the built mkldnn and acl will be used by tao_bridge and tao_compiler with CMake and bazel. When used in tao_compiler, the tao dir is linked under tf_community.

When trying to support bazel build for mkldnn, the newly added mkldnn bazel rules under third_party/bazel/mkldnn cannot be used in tf_community dir without patch code in tf_community. So for now, if we only support bazel build for tao_bridge part, the download and compile for mkldnn will not be deleted. However when tao_compiler become a single bazel workspace, we can use our own mkldnn bazel rules without doing extra build actions in common_setup.py. We have the following actions to follow:

1 && 2 are onging actions.

Orion34-lanbo commented 2 years ago

Update: In sprint2204, we have complete the work of build tao_bridge with bazel for internal version and opensource version build. And the internal also uses open-source tao_build.py now. As for now, all of disc's target can be build by bazel. The remaining works to be done are as follows:

The last 2 items should be a little bit long-term work items since current bazel build from multiple workspaces works fine for now. However our final goal is still make the entire repo build in one large workspace.

qiuxiafei commented 2 years ago

To refine project structure of tensorflow_blade

  1. Currently, tensorflow_blade is fully managed by Bazel, while in pytorch_blade Bazel only takes control of C++ code. After some discussion, we decide to let tensorflow_blade follow pytorch_blade's style. This will help to free python project from miscellaneous Bazel details. For examples,
    1. Python developers usually treat a whole directory a module instead of using fine-grained Bazel targets.
    2. Making .whl package is also not straight-forward in Bazel since it requires developers to specify dependencies carefully to make sure it's included in final package, to write a wrapper shell script and to make the script a Bazel target. But this is pretty simple and easy in original python.
  2. Make the internal part of tensorflow_blade independent from PAI-Blade project and build from public tensorflow_blade just like torch_blade do. This will also help to make PAI-Blade as a pure python project (free from Bazel, too).

Tasks:

update 2022-06-27: