yrwy commented 5 years ago

我碰到好几个问题

第一个是检测cuda dylib出错，明明有却出现 soname error 通过修改bzl 文件可以过。。第二个是absl 编译时出错,好像是模板用了什么关键字导致错误。难道你用了旧版的替换了吗？

TomHeaven commented 5 years ago

v2.0.0 / v1.14 需要七种新补丁，这个修改笔记，我也没时间整理了，你凑合看吧：

v2.0.0-beta

使用CUDA版本：3.0,3.5,5.0,5.2,6.1,7.0

编译错误1：找不到CUDA相关的库。与third-party/gpu相关。


None of the libraries match their SONAME: /usr/local/cuda/lib64/libcudart.10.0.dylib

Workaround: 直接注释掉third-party/gpus/cuda_configure.bazel:554-560


def find_lib(repository_ctx, paths, check_soname = True):
    """
      Finds a library among a list of potential paths.
      Args:
        paths: List of paths to inspect.
      Returns:
        Returns the first path in paths that exist.
    """
    objdump = repository_ctx.which("objdump")
    mismatches = []
    for path in [repository_ctx.path(path) for path in paths]:
        if not path.exists:
            continue
        #if check_soname and objdump != None and not _is_windows(repository_ctx):
        # output = repository_ctx.execute([objdump, "-p", str(path)]).stdout
        # output = [line for line in output.splitlines() if "SONAME" in line]
        # sonames = [line.strip().split(" ")[-1] for line in output]
        # if not any([soname == path.basename for soname in sonames]):
        # mismatches.append(str(path))
        # continue
        return path
    if mismatches:
        auto_configure_fail(
            "None of the libraries match their SONAME: " + ", ".join(mismatches),
        )
    auto_configure_fail("No library found under: " + ", ".join(paths))

或者修改为


output = repository_ctx.execute([objdump, "-p", str(path)]).stdout

output = [line for line in output.splitlines() if "name @rpath/" in line]

sonames = [line.strip().split("/")[-1] for line in output]
sonames = [sonames[0].strip().split(" ")[0] for line in output]

错误原因为objdump命令获取的信息在osx和linux上不一样。

编译错误2：


ERROR: /private/var/tmp/_bazel_tomheaven/561821a038e9c8d51ab53646fb4bd33f/external/local_config_cuda/cuda/BUILD:168:1: Couldn't build file external/local_config_cuda/cuda/cuda/include/builtin_types.h: Executing genrule @local_config_cuda//cuda:cuda-include failed (Exit 1)
cp: the -H, -L, and -P options may not be specified with the -r option.

原因： osx cp命令不识别参数 -rLf，修改 third-party/gpus/cuda_configure.bazel:935行为


  #cmd = \"""cp -rLf "%s/." "%s/" \""",
#)""" % (name, "\n".join(outs), src_dir, out_dir)

 cmd = \"""cp -r -f "%s/." "%s/" \""",
)""" % (name, "\n".join(outs), src_dir, out_dir)

编译错误3：找不到libcuda。原因：cuda_configure.bazel在cd /usr/cuda/lib64/stubs目录下找libcuda.dylib，而licbcuda.dylib在 /usr/cuda/lib64/目录下。解决

方案1：修改 third-party/gpus/cuda_configure.bazel:605 find_lib函数


    #stub_dir = "" if _is_windows(repository_ctx) else "/stubs"
    stub_dir = "" if _is_windows(repository_ctx) else ""

方案2：将libcuda.dylib复制过去


cd /usr/cuda/lib64/

sudo cp libcuda.dylib stubs/

编译错误4：./tensorflow/core/util/gpu_device_functions.h(144): error: identifier "__nvvm_read_ptx_sreg_laneid" is undefined

修改142-147为


#if GOOGLE_CUDA
//#if __clang__
// return __nvvm_read_ptx_sreg_laneid();
//#else // __clang__
  asm("mov.u32 %0, %%laneid;" : "=r"(lane_id));
//#endif // __clang__

编译错误5：


external/com_google_absl/absl/container/internal/compressed_tuple.h:170:53: error: use 'template' keyword to treat 'Storage' as a dependent template name
return (std::move(*this).internal_compressed_tuple::Storage< CompressedTuple, I> ::get());

修改源码bazel-tensorflow/external/com_google_absl/absl/container/internal/compressed_tuple.h:168-178，注释掉两个问题函数：


/*template <int I>
  ElemT<I>&& get() && {
    return std::move(*this).internal_compressed_tuple::template Storage<CompressedTuple, I>::get();
  }
  template <int I>
  constexpr const ElemT<I>&& get() const&& {
    return absl::move(*this).internal_compressed_tuple::template Storage<CompressedTuple, I>::get();
  }*/

这两个函数是魔鬼，怎么改都编译错误，只能注释掉。

参考：https://stackoverflow.com/questions/3786360/confusing-template-error

编译错误6：


tensorflow/core/kernels/tridiagonal_solve_op_gpu.cu.cc(46): error: calling a __host__ function("std::__1::operator ==<float> ") from a __global__ function("tensorflow::SolveForSizeOneOrTwoKernel< ::std::__1::complex<float> > ") is not allowed

tensorflow/core/kernels/tridiagonal_solve_op_gpu.cu.cc(55): error: calling a __host__ function("std::__1::operator ==<float> ") from a __global__ function("tensorflow::SolveForSizeOneOrTwoKernel< ::std::__1::complex<float> > ") is not allowed

修改global为device


//__global__ void SolveForSizeOneOrTwoKernel(const int m, const Scalar* diags,

__device__ void SolveForSizeOneOrTwoKernel(const int m, const Scalar* diags,

编译错误7：


tensorflow/core/kernels/conv_grad_filter_ops.cc:736:18: error: constexpr variable 'kComputeInNHWC' must be initialized by a constant expression
  constexpr auto kComputeInNHWC =

修改多个源码文件 conv_grad_filter_ops.cc, conv_grad_input_ops.cc, conv_ops.cc (v1.14.0正式版还需要修改这个) ，分别去掉两处constexpr。

之前的源码补丁继续用。照常编译。

yrwy commented 5 years ago

ABSL我替换成r1.13版本的那个ABSL后编译正常 XLA 在r1.14编译不成功编译到最后本来就已经加了--nonccl 还去找thirdparty/nccl/nccl.h xcode9.4.1编译XLA时需要改很多constexpr --> const 而用了xcode10.1后很多地方不需要改...xcode真让人头疼。。 macos 10.13 还存在显存泄露的问题 macos10.12可以通过复制 sudo cp /Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_378.10.10.10_mercury.dylib /Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_378.05.05_mercury.dylib 支持 cuda10.1 cuda10

TomHeaven commented 5 years ago

看起来问题已经解决

TomHeaven / tensorflow-osx-build

楼主能否说下r1.14如何编译的？ #13

v2.0.0-beta