NVIDIA / tensorflow

An Open Source Machine Learning Framework for Everyone
https://developer.nvidia.com/deep-learning-frameworks
Apache License 2.0
962 stars 144 forks source link

Building CUDA 12.1 (nv23.02) from source fails #84

Open bergentruckung opened 1 year ago

bergentruckung commented 1 year ago

We're trying to build CUDA 12.1 (from HEAD of r1.15.5+nv23.02) from source on both RHEL 7 and 8 (3.10.0-1160.88.1 and 4.18.0-425.13.1 kernels respectively), but they both fail towards the end at the same place.

CUDNN: 8.8.1.3 NCCL: 2.17.1 bazel: 0.25.3 (and a bunch of others as well) gcc: 12.1.1 (from devtoolset-12) Python: 3.10

Relevant trace:

ERROR: /local/apps/bergentruckung/nv-tensorflow/tensorflow/contrib/ignite/BUILD:146:21: Linking of rule '//tensorflow/contrib/ignite:gen_gen_dataset_ops_py_wrappers_cc' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/host/bin/tensorflow/contrib/ignite/gen_gen_dataset_ops_py_wrappers_cc-2.params
bazel-out/host/bin/tensorflow/python/libpython_op_gen_main.a(python_op_gen_main.o): In function `tensorflow::(anonymous namespace)::ReadOpListFromFile(std::string const&, std::vector<std::string, std::allocator<std::string> >*)':
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0xa3): undefined reference to `tensorflow::io::InputBuffer::InputBuffer(tensorflow::RandomAccessFile*, unsigned long)'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0xd8): undefined reference to `tensorflow::io::InputBuffer::ReadLine(std::string*)'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0x19f): undefined reference to `tensorflow::io::InputBuffer::ReadLine(std::string*)'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0x330): undefined reference to `tensorflow::io::InputBuffer::~InputBuffer()'
python_op_gen_main.cc:(.text._ZN10tensorflow12_GLOBAL__N_118ReadOpListFromFileERKSsPSt6vectorISsSaISsEE+0x44d): undefined reference to `tensorflow::io::InputBuffer::~InputBuffer()'
bazel-out/host/bin/tensorflow/python/libpython_op_gen.lo(python_op_gen.o): In function `tensorflow::(anonymous namespace)::GenEagerPythonOp::Code()':
python_op_gen.cc:(.text._ZN10tensorflow12_GLOBAL__N_116GenEagerPythonOp4CodeEv+0x2d1): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                                                   
bazel-out/host/bin/tensorflow/python/libpython_op_gen.lo(python_op_gen_internal.o): In function `tensorflow::python_op_gen_internal::GenPythonOp::AddDocStringAttrs()':                                                                                                                                                      
python_op_gen_internal.cc:(.text._ZN10tensorflow22python_op_gen_internal11GenPythonOp17AddDocStringAttrsEv+0x170): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                       
python_op_gen_internal.cc:(.text._ZN10tensorflow22python_op_gen_internal11GenPythonOp17AddDocStringAttrsEv+0x3f1): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                       
bazel-out/host/bin/tensorflow/python/libpython_op_gen.lo(python_op_gen_internal.o): In function `tensorflow::python_op_gen_internal::GenPythonOp::Code()':                                                                                                                                                                   
python_op_gen_internal.cc:(.text._ZN10tensorflow22python_op_gen_internal11GenPythonOp4CodeEv+0x203): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                                     
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `std::__detail::_Map_base<std::string, std::pair<std::string const, tensorflow::ApiDef>, std::allocator<std::pair<std::string const, tensorflow::ApiDef> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std:
:__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)':                                                                                                                        
op_gen_lib.cc:(.text._ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEESaIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_[_ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEES
aIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_]+0x98): undefined reference to `tensorflow::ApiDef::ApiDef()'                                                                                  
op_gen_lib.cc:(.text._ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEESaIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_[_ZNSt8__detail9_Map_baseISsSt4pairIKSsN10tensorflow6ApiDefEES
aIS5_ENS_10_Select1stESt8equal_toISsESt4hashISsENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS2_]+0x13e): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `std::_Hashtable<std::string, std::pair<std::string const, tensorflow::ApiDef>, std::allocator<std::pair<std::string const, tensorflow::ApiDef> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail
::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()':                                                                                                                                                       
op_gen_lib.cc:(.text._ZNSt10_HashtableISsSt4pairIKSsN10tensorflow6ApiDefEESaIS4_ENSt8__detail10_Select1stESt8equal_toISsESt4hashISsENS6_18_Mod_range_hashingENS6_20_Default_ranged_hashENS6_20_Prime_rehash_policyENS6_17_Hashtable_traitsILb1ELb0ELb1EEEED2Ev[_ZNSt10_HashtableISsSt4pairIKSsN10tensorflow6ApiDefEESaIS4_ENS
t8__detail10_Select1stESt8equal_toISsESt4hashISsENS6_18_Mod_range_hashingENS6_20_Default_ranged_hashENS6_20_Prime_rehash_policyENS6_17_Hashtable_traitsILb1ELb0ELb1EEEED5Ev]+0x38): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                   
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `tensorflow::ApiDefMap::ApiDefMap(tensorflow::OpList const&)':                                                                                                                                                                                 
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0xcc): undefined reference to `tensorflow::ApiDef::ApiDef()'                                                                                                                                                                                                  
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x13f): undefined reference to `tensorflow::ApiDef_Endpoint* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Endpoint>(google::protobuf::Arena*)'                                                                                              
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x1f3): undefined reference to `tensorflow::ApiDef_Arg* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Arg>(google::protobuf::Arena*)'                                                                                                        
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x3cb): undefined reference to `tensorflow::ApiDef_Arg* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Arg>(google::protobuf::Arena*)'                                                                                                        
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x563): undefined reference to `tensorflow::ApiDef_Attr* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Attr>(google::protobuf::Arena*)'                                                                                                      
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x76b): undefined reference to `tensorflow::ApiDef::CopyFrom(tensorflow::ApiDef const&)'                                                                                                                                                                      
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x777): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                                                                                                                                
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0x9a4): undefined reference to `tensorflow::ApiDef::~ApiDef()'                                                                                                                                                                                                
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `tensorflow::ApiDefMap::LoadApiDef(std::string const&)':                                                                                                                                                                                       
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x56): undefined reference to `tensorflow::ApiDefs::ApiDefs()'                                                                                                                                                                                               
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x43a): undefined reference to `tensorflow::_ApiDef_Attr_default_instance_'                                                                                                                                                                                  
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x6b1): undefined reference to `tensorflow::ApiDefs::~ApiDefs()'                                                                                                                                                                                             
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x725): undefined reference to `tensorflow::ApiDef_Endpoint::Clear()'                                                                                                                                                                                        
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x789): undefined reference to `tensorflow::ApiDef_Endpoint* google::protobuf::Arena::CreateMaybeMessage<tensorflow::ApiDef_Endpoint>(google::protobuf::Arena*)'                                                                                             
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x7a7): undefined reference to `tensorflow::ApiDef_Endpoint::CopyFrom(tensorflow::ApiDef_Endpoint const&)'                                                                                                                                                   op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMap10LoadApiDefERKSs+0x103d): undefined reference to `tensorflow::ApiDefs::~ApiDefs()'                                                                                                                                                                        
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::CopyFrom(stream_executor::dnn::AlgorithmProto const&)'                             
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::~ExecutionPlanProto()'                                                         
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::InternalSwap(stream_executor::dnn::ExecutionPlanProto*)'                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::AlgorithmProto(stream_executor::dnn::AlgorithmProto const&)'                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `tensorflow::ProtoDebugString(tensorflow::DeviceAttributes const&)'                                                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::InternalSwap(stream_executor::dnn::TensorDescriptorProto*)'                 
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `tensorflow::ProtoShortDebugString(tensorflow::ConfigProto const&)'                                                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::CopyFrom(stream_executor::dnn::ExecutionPlanProto const&)'                     
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::~AlgorithmProto()'                                                                 
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ConvolutionDescriptorProto::~ConvolutionDescriptorProto()'                                         
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::ExecutionPlanProto(stream_executor::dnn::ExecutionPlanProto const&)'           
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::~TensorDescriptorProto()'                                                   
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::clear_layout_oneof()'                                                       
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ConvolutionDescriptorProto::ConvolutionDescriptorProto()'                                          
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::TensorDescriptorProto(stream_executor::dnn::TensorDescriptorProto const&)'  
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::TensorDescriptorProto()'                                                    
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::InternalSwap(stream_executor::dnn::AlgorithmProto*)'                               
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::AlgorithmProto::AlgorithmProto()'                                                                  
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::ExecutionPlanProto::ExecutionPlanProto()'                                                          
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Udataset_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `stream_executor::dnn::TensorDescriptorProto::CopyFrom(stream_executor::dnn::TensorDescriptorProto const&)'               
collect2: error: ld returned 1 exit status

Please let us know how we can go ahead with the build process here.

Thanks in advance.

nluehr commented 1 year ago

Those linker errors aren't red flags for any issue I'm familiar with. If you provide reproducer instructions similar to the ubuntu build instructions here I can take a look at what is going wrong.

bergentruckung commented 1 year ago

I see, if it helps to add some additional colour, this is the last portion of the build process (after the message from above).

SUBCOMMAND: # //tensorflow/contrib/image:gen_image_ops_py_wrappers_cc [action 'Linking tensorflow/contrib/image/gen_image_ops_py_wrappers_cc [for host]', conf                    
iguration: fc79f5a2b8c3ab837b6ff6617f003a26cc0d2bb81faca1c37ceff7494a54f2ff, execution platform: @local_config_platform//:host]                                                   
(cd /var/tmp/tf.aSaiDr/.cache/bazel/_bazel_bergentruckung/5c5196772a9ffda2bec0a4e93f813cd8/execroot/org_tensorflow && \                                                                 
  exec env - \                                                                                                                                                                    
    PATH=<redacted> \                                                          
    PWD=/proc/self/cwd \                                                                                                                                                          
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/host/bin/tensorflow/contrib/image/gen_image_ops_py_wrappers_cc-2.params)          
Target //tensorflow/tools/pip_package:build_pip_package failed to build                                                                                                           
Use --verbose_failures to see the command lines of failed build steps.                                                                                                            
INFO: Elapsed time: 977.812s, Critical Path: 251.56s                                                                                                                              
INFO: 31971 processes: 15017 internal, 16954 local.                                                                                                                               
FAILED: Build did NOT complete successfully                                                                                                                                       
bergentruckung commented 1 year ago

A reproducer will be similar to this:

  1. Install CUDA toolkit, newer drivers from Nvidia's official page (follow the corresponding documentation there)
  2. Install the right version of CUDNN and NCCL from Nvidia's official pages (I used the local rpm method for installing the packages and then made sure to install both the devel packages - for header files that we need later on)
  3. Checkout the 1.15.5nv23.02 branch from github.com/NVIDIA/tensorflow and also, v0.7.3 from github.com/NVIDIA/cudnn-frontend
  4. Create a new python-3.10 virtual environment
  5. Rest of the steps are similar to the doc that you linked above - we figure out the bazel version that's required, set all the configuration env variables (that we need) and then run ./configure, followed by running the bazel build commands
nluehr commented 1 year ago

Have you tried a bazel clean --expunge as well as disabling any bazel caching / ccache options you might use?

bergentruckung commented 1 year ago

Sorry, I missed to respond back here. It's still the same for me, even after running bazel clean --expunge.