Closed yrwy closed 5 years ago
v2.0.0 / v1.14 需要七种新补丁,这个修改笔记,我也没时间整理了,你凑合看吧:
使用CUDA版本:3.0,3.5,5.0,5.2,6.1,7.0
None of the libraries match their SONAME: /usr/local/cuda/lib64/libcudart.10.0.dylib
Workaround: 直接注释掉third-party/gpus/cuda_configure.bazel:554-560
def find_lib(repository_ctx, paths, check_soname = True):
"""
Finds a library among a list of potential paths.
Args:
paths: List of paths to inspect.
Returns:
Returns the first path in paths that exist.
"""
objdump = repository_ctx.which("objdump")
mismatches = []
for path in [repository_ctx.path(path) for path in paths]:
if not path.exists:
continue
#if check_soname and objdump != None and not _is_windows(repository_ctx):
# output = repository_ctx.execute([objdump, "-p", str(path)]).stdout
# output = [line for line in output.splitlines() if "SONAME" in line]
# sonames = [line.strip().split(" ")[-1] for line in output]
# if not any([soname == path.basename for soname in sonames]):
# mismatches.append(str(path))
# continue
return path
if mismatches:
auto_configure_fail(
"None of the libraries match their SONAME: " + ", ".join(mismatches),
)
auto_configure_fail("No library found under: " + ", ".join(paths))
或者修改为
output = repository_ctx.execute([objdump, "-p", str(path)]).stdout
output = [line for line in output.splitlines() if "name @rpath/" in line]
sonames = [line.strip().split("/")[-1] for line in output]
sonames = [sonames[0].strip().split(" ")[0] for line in output]
错误原因为objdump命令获取的信息在osx和linux上不一样。
ERROR: /private/var/tmp/_bazel_tomheaven/561821a038e9c8d51ab53646fb4bd33f/external/local_config_cuda/cuda/BUILD:168:1: Couldn't build file external/local_config_cuda/cuda/cuda/include/builtin_types.h: Executing genrule @local_config_cuda//cuda:cuda-include failed (Exit 1)
cp: the -H, -L, and -P options may not be specified with the -r option.
原因: osx cp命令不识别参数 -rLf,修改 third-party/gpus/cuda_configure.bazel:935
行为
#cmd = \"""cp -rLf "%s/." "%s/" \""",
#)""" % (name, "\n".join(outs), src_dir, out_dir)
cmd = \"""cp -r -f "%s/." "%s/" \""",
)""" % (name, "\n".join(outs), src_dir, out_dir)
方案1:修改 third-party/gpus/cuda_configure.bazel:605
find_lib函数
#stub_dir = "" if _is_windows(repository_ctx) else "/stubs"
stub_dir = "" if _is_windows(repository_ctx) else ""
方案2:将libcuda.dylib复制过去
cd /usr/cuda/lib64/
sudo cp libcuda.dylib stubs/
./tensorflow/core/util/gpu_device_functions.h(144): error: identifier "__nvvm_read_ptx_sreg_laneid" is undefined
修改142-147为
#if GOOGLE_CUDA
//#if __clang__
// return __nvvm_read_ptx_sreg_laneid();
//#else // __clang__
asm("mov.u32 %0, %%laneid;" : "=r"(lane_id));
//#endif // __clang__
external/com_google_absl/absl/container/internal/compressed_tuple.h:170:53: error: use 'template' keyword to treat 'Storage' as a dependent template name
return (std::move(*this).internal_compressed_tuple::Storage< CompressedTuple, I> ::get());
修改源码bazel-tensorflow/external/com_google_absl/absl/container/internal/compressed_tuple.h:168-178
,注释掉两个问题函数:
/*template <int I>
ElemT<I>&& get() && {
return std::move(*this).internal_compressed_tuple::template Storage<CompressedTuple, I>::get();
}
template <int I>
constexpr const ElemT<I>&& get() const&& {
return absl::move(*this).internal_compressed_tuple::template Storage<CompressedTuple, I>::get();
}*/
这两个函数是魔鬼,怎么改都编译错误,只能注释掉。
参考:https://stackoverflow.com/questions/3786360/confusing-template-error
tensorflow/core/kernels/tridiagonal_solve_op_gpu.cu.cc(46): error: calling a __host__ function("std::__1::operator ==<float> ") from a __global__ function("tensorflow::SolveForSizeOneOrTwoKernel< ::std::__1::complex<float> > ") is not allowed
tensorflow/core/kernels/tridiagonal_solve_op_gpu.cu.cc(55): error: calling a __host__ function("std::__1::operator ==<float> ") from a __global__ function("tensorflow::SolveForSizeOneOrTwoKernel< ::std::__1::complex<float> > ") is not allowed
修改global为device
//__global__ void SolveForSizeOneOrTwoKernel(const int m, const Scalar* diags,
__device__ void SolveForSizeOneOrTwoKernel(const int m, const Scalar* diags,
tensorflow/core/kernels/conv_grad_filter_ops.cc:736:18: error: constexpr variable 'kComputeInNHWC' must be initialized by a constant expression
constexpr auto kComputeInNHWC =
修改多个源码文件 conv_grad_filter_ops.cc, conv_grad_input_ops.cc, conv_ops.cc
(v1.14.0正式版还需要修改这个) ,分别去掉两处constexpr
。
ABSL我替换成r1.13版本的那个ABSL后编译正常 XLA 在r1.14编译不成功 编译到最后本来就已经加了--nonccl 还去找thirdparty/nccl/nccl.h xcode9.4.1编译XLA时 需要改很多constexpr --> const 而用了xcode10.1后 很多地方不需要改...xcode真让人头疼。。 macos 10.13 还存在显存泄露的问题 macos10.12可以通过复制 sudo cp /Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_378.10.10.10_mercury.dylib /Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_378.05.05_mercury.dylib 支持 cuda10.1 cuda10
看起来问题已经解决
我碰到好几个问题
第一个是检测cuda dylib出错,明明有却出现 soname error 通过修改bzl 文件可以过。。 第二个是absl 编译时出错,好像是模板用了什么关键字导致错误。难道你用了旧版的替换了吗?