NVIDIA / tensorflow

An Open Source Machine Learning Framework for Everyone
https://developer.nvidia.com/deep-learning-frameworks
Apache License 2.0
990 stars 152 forks source link

tfcompile fails to build #29

Open jchia opened 3 years ago

jchia commented 3 years ago

System information

Describe the problem From within a docker container running nvcr.io/nvidia/tensorflow:21.05-tf1-py3, tfcompile fails to build:

$ root@48f2340d016b:/opt/tensorflow/tensorflow-source# bazel build --config=opt --config=cuda //tensorflow/compiler/aot:tfcompile
...
INFO: Analysed target //tensorflow/compiler/aot:tfcompile (124 packages loaded, 11185 targets configured).
INFO: Found 1 target...
ERROR: /opt/tensorflow/tensorflow-source/tensorflow/compiler/aot/BUILD:190:1: C++ compilation of rule '//tensorflow/compiler/aot:embedded_protocol_buffers' failed (Exit 1)
tensorflow/compiler/aot/embedded_protocol_buffers.cc: In function ‘xla::StatusOr<std::__cxx11::basic_string<char> > tensorflow::tfcompile::CodegenModule(llvm::TargetMachine*, std::unique_ptr<llvm::Module>)’:
tensorflow/compiler/aot/embedded_protocol_buffers.cc:85:32: error: ‘CGFT_ObjectFile’ is not a member of ‘llvm::TargetMachine’
   85 |           llvm::TargetMachine::CGFT_ObjectFile)) {
      |                                ^~~~~~~~~~~~~~~
Target //tensorflow/compiler/aot:tfcompile failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 14.014s, Critical Path: 9.96s
INFO: 171 processes: 171 local.
FAILED: Build did NOT complete successfully

I have modified the .tf_configure.bazelrc but I think the changes are irrelevant to the failure:

build --action_env PYTHON_BIN_PATH="/usr/bin/python3.8"
build --action_env PYTHON_LIB_PATH="/usr/local/lib/python3.8/dist-packages"
build --python_path="/usr/bin/python3.8"
build:xla --define with_xla_support=true
build --config=xla
build --action_env TF_USE_CCACHE="0"
build --copt=-march=haswell
build:opt --define with_default_optimizations=true
build:v2 --define=tf_api_version=2
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_tag_filters=-benchmark-test,-no_oss,-oss_serial
test --build_tag_filters=-benchmark-test,-no_oss
test --test_tag_filters=-gpu
test --build_tag_filters=-gpu
build --action_env TF_CONFIGURE_IOS="0"

I believe nvidia-tensorflow has some llvm-related changes wrt to upstream but maybe the focus was on getting Python tensorflow working with Nvidia hardware without attention to less commonly-used parts like tfcompile. This failure looks like llvm code not being right for the llvm version. Upstream 1.15.5 builds with no problems.