FloopCZ / tensorflow_cc

Build and install TensorFlow C++ API library.
MIT License
758 stars 183 forks source link

Linking step fails with undefined symbols. #299

Open CarloWood opened 11 months ago

CarloWood commented 11 months ago

After two hours of compiling, the linking step fails! :(

daniel:~/projects/machine-learning/tensorflow_cc/tensorflow_cc/tensorflow_cc/build>make
[ 12%] Performing build step for 'tensorflow_base'
CUDA support enabled
find: ‘/opt/chroots/linuxviewer20230118/root/var/db/sudo’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/cache/ldconfig’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/cache/private’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/log/audit’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/log/private’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/lib/machines’: Permission denied
find: ‘/opt/chroots/linuxviewer20230118/root/var/lib/portables’: Permission denied
...long list...
find: ‘/usr/local/lost+found’: Permission denied
find: ‘/usr/lost+found’: Permission denied
find: ‘/usr/share/polkit-1/rules.d’: Permission denied
TF_NCCL_VERSION=""   <-- I added these to show that the find doesn't even find anything.
TF_CUDNN_VERSION=""
You have bazel 6.3.2 installed.
Found CUDA 12.2 in:
    /opt/cuda/targets/x86_64-linux/lib
    /opt/cuda/targets/x86_64-linux/include
Found cuDNN 8 in:
    /usr/lib
    /usr/include

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
        --config=mkl            # Build with MKL support.
        --config=mkl_aarch64    # Build with oneDNN and Compute Library for the Arm Architecture (ACL).
        --config=monolithic     # Config for mostly static monolithic build.
        --config=numa           # Build with NUMA support.
        --config=dynamic_kernels        # (Experimental) Build kernels into separate shared objects.
        --config=v1             # Build with TensorFlow 1 API instead of TF 2 API.
Preconfigured Bazel build configs to DISABLE default on features:
        --config=nogcp          # Disable GCP support.
        --config=nonccl         # Disable NVIDIA NCCL support.
Configuration finished

and then

Starting local Bazel server and connecting to it...
WARNING: while reading option defaults file '/usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc':
  invalid command name 'startup:windows'.
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=145
INFO: Reading rc options for 'build' from /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc:
  'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
INFO: Reading rc options for 'build' from /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.tf_configure.bazelrc:
  'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python3 --action_env PYTHON_LIB_PATH=/usr/lib/python3.11/site-packages --python_path=/usr/bin/python3 --action_env TF_CUDA_VERSION=12.2 --action_env TF_CUDNN_VERSION= --action_env TF_NCCL_VERSION= --action_env TF_CUDA_PATHS=/opt/cuda-12.2,/opt/cuda,/usr/local/cuda-12.2,/usr/local/cuda,/usr/local,/usr/cuda-12.2,/usr/cuda,/usr --action_env CUDA_TOOLKIT_PATH=/opt/cuda --action_env NCCL_INSTALL_PATH=/usr --action_env TF_CUDA_COMPUTE_CAPABILITIES=sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86,compute_86 --action_env GCC_HOST_COMPILER_PATH=/usr/bin/gcc-11 --config=cuda
INFO: Found applicable config definition build:short_logs in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:cuda in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
INFO: Found applicable config definition build:opt in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.tf_configure.bazelrc: --copt=-march=haswell --host_copt=-march=haswell
INFO: Found applicable config definition build:monolithic in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --define framework_shared_object=false --define tsl_protobuf_header_only=false --experimental_link_static_libraries_once=false
INFO: Found applicable config definition build:cuda in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
INFO: Found applicable config definition build:linux in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --copt=-Wno-error=unused-but-set-variable --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
WARNING: while reading option defaults file '/usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/.bazelrc':
  invalid command name 'startup:windows'.
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
INFO: Analyzed 2 targets (471 packages loaded, 38435 targets configured).
INFO: Found 2 targets...

I just ran it again, so everything was already compiled and we go straight to linking again:

ERROR: /usr/src/tensorflow_cc/tensorflow_cc/tensorflow_cc/build/tensorflow/tensorflow/BUILD:1291:21: Linking tensorflow/libtensorflow_cc.so.2.14.0 failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target //tensorflow:libtensorflow_cc.so.2.14.0) external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/k8-opt/bin/tensorflow/libtensorflow_cc.so.2.14.0-2.params
/opt/home_carlo/dot_cache/bazel/_bazel_carlo/01a1b20f96784390f57aac7671723885/execroot/org_tensorflow/external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc:44: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
  import pipes
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::(anonymous namespace)::MlirNextAfterGPUDT_FLOATDT_FLOATOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12_GLOBAL__N_134MlirNextAfterGPUDT_FLOATDT_FLOATOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x14): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_FLOAT_DT_FLOAT'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::(anonymous namespace)::MlirNextAfterGPUDT_DOUBLEDT_DOUBLEOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12_GLOBAL__N_136MlirNextAfterGPUDT_DOUBLEDT_DOUBLEOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x14): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_DOUBLE_DT_DOUBLE'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::MLIROpKernel<(tensorflow::DataType)1, float, (tensorflow::DataType)1>::Compute(tensorflow::OpKernelContext*)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12MLIROpKernelILNS_8DataTypeE1EfLS1_1EE7ComputeEPNS_15OpKernelContextE[_ZN10tensorflow12MLIROpKernelILNS_8DataTypeE1EfLS1_1EE7ComputeEPNS_15OpKernelContextE]+0x1bc): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_FLOAT_DT_FLOAT'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_nextafter_op.pic.lo(gpu_op_next_after.pic.o): in function `tensorflow::MLIROpKernel<(tensorflow::DataType)2, double, (tensorflow::DataType)2>::Compute(tensorflow::OpKernelContext*)':
gpu_op_next_after.cc:(.text._ZN10tensorflow12MLIROpKernelILNS_8DataTypeE2EdLS1_2EE7ComputeEPNS_15OpKernelContextE[_ZN10tensorflow12MLIROpKernelILNS_8DataTypeE2EdLS1_2EE7ComputeEPNS_15OpKernelContextE]+0x1bc): undefined reference to `_mlir_ciface_NextAfter_GPU_DT_DOUBLE_DT_DOUBLE'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::(anonymous namespace)::MlirEluGPUDT_HALFDT_HALFOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_elu.cc:(.text._ZN10tensorflow12_GLOBAL__N_126MlirEluGPUDT_HALFDT_HALFOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Elu_GPU_DT_HALF_DT_HALF'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::(anonymous namespace)::MlirEluGPUDT_FLOATDT_FLOATOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_elu.cc:(.text._ZN10tensorflow12_GLOBAL__N_128MlirEluGPUDT_FLOATDT_FLOATOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Elu_GPU_DT_FLOAT_DT_FLOAT'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::(anonymous namespace)::MlirEluGPUDT_DOUBLEDT_DOUBLEOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_elu.cc:(.text._ZN10tensorflow12_GLOBAL__N_130MlirEluGPUDT_DOUBLEDT_DOUBLEOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Elu_GPU_DT_DOUBLE_DT_DOUBLE'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_elu.pic.o): in function `tensorflow::MLIROpKernel<(tensorflow::DataType)19, Eigen::half, (tensorflow::DataType)19>::Compute(tensorflow::OpKernelContext*)':
gpu_op_elu.cc:(.text._ZN10tensorflow12MLIROpKernelILNS_8DataTypeE19EN5Eigen4halfELS1_19EE7ComputeEPNS_15OpKernelContextE[_ZN10tensorflow12MLIROpKernelILNS_8DataTypeE19EN5Eigen4halfELS1_19EE7ComputeEPNS_15OpKernelContextE]+0x1b8): undefined reference to `_mlir_ciface_Elu_GPU_DT_HALF_DT_HALF'
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/libgpu_relu_op.pic.lo(gpu_op_relu.pic.o): in function `tensorflow::(anonymous namespace)::MlirReluGPUDT_HALFDT_HALFOp::Invoke(tensorflow::OpKernelContext*, llvm::SmallVectorImpl<tensorflow::UnrankedMemRef>&)':
gpu_op_relu.cc:(.text._ZN10tensorflow12_GLOBAL__N_127MlirReluGPUDT_HALFDT_HALFOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x10): undefined reference to `_mlir_ciface_Relu_GPU_DT_HALF_DT_HALF'
... and so on (very very long list)...
gpu_op_zeta.cc:(.text._ZN10tensorflow12_GLOBAL__N_131MlirZetaGPUDT_DOUBLEDT_DOUBLEOp6InvokeEPNS_15OpKernelContextERN4llvm15SmallVectorImplINS_14UnrankedMemRefEEE+0x14): undefined reference to `_mlir_ciface_Zeta_GPU_DT_DOUBLE_DT_DOUBLE'
collect2: error: ld returned 1 exit status
INFO: Elapsed time: 60.283s, Critical Path: 52.06s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
make[2]: *** [CMakeFiles/tensorflow_base.dir/build.make:87: tensorflow-stamp/tensorflow_base-build] Error 1
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/tensorflow_base.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Can you please give me a hint, or ask me to test something?

Note that I made the following change:

diff --git a/tensorflow_cc/PROJECT_VERSION b/tensorflow_cc/PROJECT_VERSION
index c8e38b6..edcfe40 100644
--- a/tensorflow_cc/PROJECT_VERSION
+++ b/tensorflow_cc/PROJECT_VERSION
@@ -1 +1 @@
-2.9.0
+2.14.0

This is the only thing I changed.

CarloWood commented 11 months ago

All 632 (unique) symbols that are undefined start with _mlir_ciface_*.

CarloWood commented 11 months ago

All 649 error lines containing 'undefined reference to' are of the following form:

^gpu_op_[a-z0-9_]*\.cc:(\.text\._Z[^+]*+0x[0-9a-f]*): undefined reference to `_mlir_ciface_[A-Za-z0-9_]*.$

showing that all undefined references come from files with a name like gpu_op_[a-z0-9_]*\.cc. All of which exclusively exist in build/tensorflow/tensorflow/core/kernels/mlir_generated/.

196 of the errors are generated from gpu_op_cast.cc (the second one is gpu_op_relu.cc with 17 errors).

The only file with only a single error are gpu_op_logical_and.cc, gpu_op_logical_not.cc and gpu_op_logical_or.cc. These three files use GENERATE_BINARY_GPU_KERNEL and REGISTER_GPU_KERNEL_NO_TYPE_CONSTRAINT each once.

From which it seems that GENERATE_BINARY_GPU_KERNEL and GENERATE_UNARY_GPU_KERNEL --- OR REGISTER_GPU_KERNEL_NO_TYPE_CONSTRAINT produces an error.

The files that generate two errors are: gpu_op_angle.cc, gpu_op_complex_abs.cc, gpu_op_complex.cc, gpu_op_conj.cc, gpu_op_imag.cc, gpu_op_polygamma.cc, gpu_op_real.cc and gpu_op_zeta.cc.

From which it seems that an error is produced by REGISTER_COMPLEX_GPU_KERNEL, GENERATE_AND_REGISTER_UNARY_GPU_KERNEL and GENERATE_AND_REGISTER_BINARY_GPU_KERNEL.

To make a long story short, it seems that the problem comes from the use of macros that use the macro MLIR_FUNCTION defined in tensorflow/tensorflow/core/kernels/mlir_generated/base_op.h:

#define MLIR_FUNCTION(tf_op, platform, input_type, output_type) \
  _mlir_ciface_##tf_op##_##platform##_##input_type##_##output_type

and well in particular: GENERATE_UNARY_KERNEL3, GENERATE_BINARY_KERNEL3 and GENERATE_TERNARY_KERNEL3 which are more or less similar, so l lets just look at one:

#define GENERATE_UNARY_KERNEL3(tf_op, platform, input_type, output_type, casted_input_type, casted_output_type)

which produces code like (I did some formatting):

extern "C" void MLIR_FUNCTION(tf_op, platform, input_type, output_type)              // <-- Undefined reference.                                 
    (UnrankedMemRef * result, OpKernelContext * ctx, UnrankedMemRef * arg);     

namespace {                                                                   

class MLIR_OP(tf_op, platform, casted_input_type, casted_output_type) :                                                                          
    public MLIROpKernel<output_type, typename EnumToDataType<output_type>::Type, casted_output_type>
{                                                                             
 public:                                                                        
  using MLIROpKernel::MLIROpKernel;

  UnrankedMemRef Invoke(OpKernelContext* ctx, llvm::SmallVectorImpl<UnrankedMemRef>& args) override
  {
    UnrankedMemRef result;                                                                           
    MLIR_FUNCTION(tf_op, platform, input_type, output_type)(&result, ctx, &args[0]);   // <-- Undefined reference.
    return result;
  }                                                                                                                                 
};                                                                            

} // namespace 
CarloWood commented 11 months ago

I found out it is an upstream problem. As of 2.14 they aren't linking with the (634 generated) bazel-out/k8-opt/bin/tensorflow/core/kernels/mlir_generated/lib*_kernel_generator.pic.a archives.

CarloWood commented 10 months ago

If you use bazel 6.1.0 it works. Then something else breaks, but this is a monologue anyway. Goodbye.