codeplaysoftware / computecpp-sdk

Collection of samples and utilities for using ComputeCpp, Codeplay's SYCL implementation
https://developer.codeplay.com/computecppce/latest/overview
Other
322 stars 90 forks source link

Unhandled exception, "Failed to build program" error in TensorFlow sample #116

Closed torokati44 closed 5 years ago

torokati44 commented 6 years ago

I have a Radeon RX580, running Ubuntu 18.04, with amdgpu driver (18.20-579836) from AMD. Simple OpenCL works fine on the GPU, both in Blender Cycles, LuxMark, and darktable-cltest. (clinfo output) ComputeCpp version is CE 0.8.0. (computecpp_info output)

After building TensorFlow (on the dev/amd_gpu branch, at https://github.com/tensorflow/tensorflow/commit/00b0040f8c8745473072d66182ff33eac1f5011e plus https://github.com/tensorflow/tensorflow/pull/17508 cherry-picked on top of it), by following the instructions here, the models/tutorials/image/mnist sample outputs this error:

2018-05-17 01:15:34.090983: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices: 2018-05-17 01:15:34.091027: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)7, cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)6, cl::sycl::exception> >' Aborted (core dumped)

All I could figure out is that the actual exception is:

Error: [ComputeCpp:RT0100] Failed to build program

And it is thrown from somewhere inside PhiloxRandomOp::Compute, at least in this specific case. Trying to run the default generator from the textgenrnn project also results in the same error. I think I've also seen a similar exception coming from CastOpBase::Compute in a different sample.

I couldn't find out any more details about the error so far.

I think #77 might be related, but not quite the same.

torokati44 commented 6 years ago

Also, this is not with ROCm, but their older (legacy?) OpenCL driver. I'm not sure if this is a TensorFlow or a ComputeCpp issue, but a more useful and informational error message in the sycl::exception from ComputeCpp sure would help!

EDIT: I'm slightly confused now that AMD apparently has three different OpenCL stacks (plus Clover of course, but let's not talk about that). I'm absolutely certain that I'm not running ROCm, as I couldn't get that to work properly yet, however hard I tried. As for which one of the other two is actually running, I'm not sure, but I'd guess it's the PAL-based, not Orca. Maybe you can figure it out from the clinfo output, and once I get home in a couple hours, I'll be glad to provide any info about the system.

lukeiwanski commented 6 years ago

@torokati44 could you provide us with clinfo output?

torokati44 commented 6 years ago

@lukeiwanski It is already linked in the first line of the report, but here it is again: [link]

lukeiwanski commented 6 years ago

@torokati44 ah sorry, I missed that. cl_khr_spir is present! good first sign! On my set-up I experienced issues when using AMDGPU-PRO (17.50.511655). The last version that worked for me is amdgpu-pro-17.40-501128 - however that's for FirePro w8100.

At present we have no access to r5xx series - therefore debugging this issue might be challenging.

Have you tried running ComputeCpp SDK examples? Are they failing too?

torokati44 commented 6 years ago

@lukeiwanski It's okay. I'll try some ComputeCpp examples and report back, once I have access to that machine again, which will be about 3-4 hours from now.

torokati44 commented 6 years ago

All samples (that I've tried) stop with the same error: hello-world backtrace

Except some of them catch it, like images:

Image size: 4096 SYCL exception caught: Error: [ComputeCpp:RT0100] Failed to build program SYCL Runtime closed with the following errors: SYCL objects are still alive while the runtime is shutting down

This probably indicates that a SYCL object was created but not properly destroyed.

Can I get my hands on a debug build of libComputeCpp.so somehow, or is that a closely guarded secret?

DuncanMcBain commented 6 years ago

We don't distribute a debug libComputeCpp.so at the moment. If every sample is failing, it doesn't look good, but we might get more information if you apply a patch looking something like the following:

diff --git a/samples/simple-vector-add/simple-vector-add.cpp b/samples/simple-vector-add/simple-vector-add.cpp
index 8ebd7d5..89c3e39 100644
--- a/samples/simple-vector-add/simple-vector-add.cpp
+++ b/samples/simple-vector-add/simple-vector-add.cpp
@@ -49,6 +49,28 @@ void simple_vadd(const std::array<T, N>& VA, const std::array<T, N>& VB,
   cl::sycl::buffer<T, 1> bufferB(VB.data(), numOfItems);
   cl::sycl::buffer<T, 1> bufferC(VC.data(), numOfItems);

+  cl::sycl::program prog(deviceQueue.get_context());
+  try {
+    prog.build_with_kernel_type<SimpleVadd<T>>("");
+  } catch (cl::sycl::exception& e) {
+    // we expect this
+    std::cout << "Expected exception caught: " << e.what();
+  }
+
+  auto clprog = prog.get();
+  size_t logsize = 0;
+  auto err = clGetProgramBuildInfo(clprog, deviceQueue.get_device().get(),
+                                   CL_PROGRAM_BUILD_LOG, sizeof(size_t), NULL,
+                                   &logsize);
+  if (err)  std::cout << "Err was nonzero! " << err << "\n";
+  std::cout << logsize << " bytes long\n";
+  auto log = std::vector<char>(logsize + 1);
+  log[logsize] = '\0';
+  err = clGetProgramBuildInfo(clprog, deviceQueue.get_device().get(),
+                              CL_PROGRAM_BUILD_LOG, logsize, log.data(), NULL);
+  if (err)  std::cout << "Err was nonzero! " << err << "\n";
+  std::cout << "Device log: " << log.data() << "\n";
+
   deviceQueue.submit([&](cl::sycl::handler& cgh) {
     auto accessorA = bufferA.template get_access<sycl_read>(cgh);
     auto accessorB = bufferB.template get_access<sycl_read>(cgh);

This patch will make the sample "simple-vector-add" output the build log in all cases, hopefully catching the build error.

When I run this using Intel CPU, I get some useful output indicating that vectorisation was successful. Unfortunately when I run this on AMD GPU, it has no output whatsoever, but I am running a very different driver version to you. I am hoping that we get useful information from your driver. (You're lucky - when my AMD driver hits a build error, it just crashes, so I'd take an actual error status as an improvement!)

torokati44 commented 6 years ago

This is what I get with the patch:

Expected exception caught: Error: [ComputeCpp:RT0100] Failed to build program122 bytes long
Device log: Error: HSAIL doesn't support OpenCL extension spir.
An invalid option was specified.
Error: BRIG code generation failed.

terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)7, cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)6, cl::sycl::exception> >'
Aborted (core dumped)

Is there something wrong with my driver setup after all?

mirh commented 6 years ago

I think #77 might be related, but not quite the same.

Ehrm, no, I mean.. Seriously, my crappy half assed setup couldn't be farther from yours.

As for which one of the other two is actually running, I'm not sure, but I'd guess it's the PAL-based, not Orca.

Really, there's only AMDGPU-PRO driver providing top-notch compute performance for your card.. And it will already select whatever driver fits it best.

p.s. 5xx series is basically a rebranded 4xx one

Device log: Error: HSAIL doesn't support OpenCL extension spir.

HSAIL is HSA Intermediate Layer.. and HSA and ROCm are basically the same thing... And last time I checked ROCm was still kinda on the sucky side... Idk. Check your setup

DuncanMcBain commented 6 years ago

Yeah it looks like your driver is the HSA driver, and it doesn't actually support SPIR. Unfortunately, at the moment the only IRs we can emit are SPIR 1.2, SPIR-V and PTX, which don't appear to be supported by your driver. It might be that this driver is too new (or SPIR is not supported on your hardware). You might be able to get more information from AMD directly.

mirh commented 6 years ago

There is this in that case btw.

torokati44 commented 6 years ago

Thanks, I could almost get that to work once, so it did run, on the GPU, except it produced garbage results (even more garbage than usual from a neural network :smile: ). I'm afraid I still have some fragments of ROCm in operation on this system somehow, as remains of these past experimentations, hence the HSAIL error (does this make sense?). What I think I'll try to do is install a fresh OS, and see what a simple installation of the recommended "legacy" driver will do.

DuncanMcBain commented 6 years ago

OK, let us know how that goes! Thanks!

torokati44 commented 6 years ago

I did just that. Got the exact same error.

So:

Expected exception caught: Error: [ComputeCpp:RT0100] Failed to build program122 bytes long
Device log: Error: HSAIL doesn't support OpenCL extension spir.
An invalid option was specified.
Error: BRIG code generation failed.

terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)7, cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)6, cl::sycl::exception> >'
Aborted (core dumped)

So, what now? :/

torokati44 commented 6 years ago

Maybe this should be reported to AMD as a bug in amdgpu-pro? The fact that it reports cl_khr_spir as a supported extension, but does not really implement it completely? Or should we investigate a bit further before doing that?

DuncanMcBain commented 6 years ago

If the device claims it supports the extension, but the runtime says it doesn't, I don't really know what to do in that case. As I understand it AMD doesn't want to support the legacy driver very much any more, which is understandable, but it might be worth highlighting that their driver seemingly has inconsistent information, given that you can query for an extension and then have a related operation fail...

torokati44 commented 6 years ago

I'd be as happy as them to drop anything called "legacy", but first I'd like to have something else to replace it, which actually works well. AFAIK ROCm won't work together with Mesa OpenGL (so neither the all-open, nor the "pro" amdgpu 3D solution), which I need, so I'm not going to use that anytime soon (at least not primarily). Clover is inadequate at the moment. So, either the "legacy" OpenCL, or the PAL based. The former behaves as described above, the latter can be installed, but with that, clinfo says Number of platforms: 0.

Anyone knows if there's a switch perhaps for the PAL driver to make it work with my GPU (even as "experimental")? Or should I just wait for the next Radeon Software release? Alternatively: ~ where to file this issue (with inconsistent info about SPIR support) for AMD, and ask them the questions above?

EDIT: I see there's a section titled "How to submit a bug report" on the Installation Instructions page for the driver, but that really only tells me how to gather all the useful information about my system. And once I have all that, what do I do with it? Where do I send it? ( <-- These are more rhetorical questions for AMD, than real ones for you, ComputeCpp developers. )

DuncanMcBain commented 6 years ago

I'm not really sure how to contact AMD for bug reports about their drivers. Maybe some of my colleagues have more experience with their hardware - I'm using quite an old one right now, which isn't particularly relevant to this issue.

RafalKonklewski commented 6 years ago

I've just managed to find a working solution to this in different discussion.

Add this to your ~/.bashrc: export GPU_FORCE_64BIT_PTR=1 export GPU_USE_SYNC_OBJECTS=1 export GPU_MAX_ALLOC_PERCENT=100 export GPU_SINGLE_ALLOC_PERCENT=100 export GPU_MAX_HEAP_SIZE=100

mirh commented 6 years ago

Are all of them needed together, or is one of them alone just enough?

DuncanMcBain commented 6 years ago

Hi @RafalKonklewski, were you seeing the same issue as @torokati44, or a different one? Is this the same hardware, or different? Using the code snippet from earlier, what error did you see? I'm quite confused!

ChrisSwinchatt commented 6 years ago

Hi all,

I'm seeing the same error as @torokati44 on Linux Mint (Ubuntu 18.04) with RX 480 and AMDGPU 18.20, using ComputeCpp 0.9.1 with @lukeiwanski's TensorFlow. Tried @RafalKonklewski's solution to no effect. My clinfo also reports that cl_khr_spir is available.

When TensorFlow's configure asks for a SYCL bitcode target (default spir64) is it possible a different value would work? I don't know what, if any, other targets there are (can't find any in configure.py).

mirh commented 6 years ago

https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/master/runtime/utils/flags.hpp These flags should be supported by both orca and (I think what is?) legacy. Plus GPU_MAX_COMMAND_BUFFERS, GPU_MAX_SUBALLOC_SIZE and PAL_RGP_DISP_COUNT after checking the icd strings.

You are free to tinker with them. And bitcode targets should be these I think?

DuncanMcBain commented 6 years ago

The only bitcode target that would be relevant for AMD hardware is spir64, but even support for that is patchy (in that AMD no longer wanted to support it, and therefore it only works on some driver and hardware combinations).

torokati44 commented 6 years ago

But SPIR-V has to be supported in a "proper" Vulkan (and cutting-edge OpenGL) driver, right? Does anyone know what's the status of that in amdgpu-pro? And you said you can emit that from ComputeCpp. Why shouldn't this work?

Ruyk commented 6 years ago

Unfortunately, SPIR-V support for graphics is different from SPIR-V support for compute, and having one in the driver does not mean having support for the other. You can always ask AMD their plans to support SPIR-V for compute in their driver.

torokati44 commented 6 years ago

I see. Sorry for the silly questions. Let's hope AMD keeps properly supporting compute in amdgpu-pro, not just in ROCm...

Ruyk commented 6 years ago

Is not a silly question at all! is actually quite confusing, and we have many people asking the same question :-)

GoogleBot42 commented 5 years ago

Sorry for commenting on an old issue. I found a rather terrible workaround.

AMD seems to have dropped spir support for opencl 1.2 as per this thread https://community.amd.com/thread/232093. It looks like they dropped support in version 18.20 of the linux driver. So if you revert to 18.10, it works. At least, "it works on my machine".

Here's a link to the driver for those who want to try. https://www.amd.com/en/support/kb/release-notes/rn-rad-pro-lin-18-10

Obviously, being locked into an old version is far from ideal but what else can we do... :/ It is possible to only install the opencl component from the driver to minimize the problem of being locked into old software. See the arch linux pkgbuild for what I'm talking about https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=opencl-amd. As such, on my current system, I'm using the open source mesa driver with the opencl component from AMD on top.

(I am using an RX 480 for anyone who might want to know)

mirh commented 5 years ago

Mhh.. the amd guy just says that applies to "Caribbean Islands"+ gpus (or maybe it meant CIK). In turn gcn 3rd gen (or maybe some 2nd) is suspiciously close to being the target of ROCm. (which I never really understood their stance on supporting SPIR...) Maybe they switched opencl to that?

Anyway... I guess this very intriguing fact goes to add to the big funny list of protips for AMD's CL linux drivers.

GoogleBot42 commented 5 years ago

@mirh

the amd guy just says that applies to "Caribbean Islands"+ gpus

Yeah I skimmed over that part. Thanks for clarifying.

which I never really understood their stance on supporting SPIR...

It would be nice if they had a more consistent implementation of compute. It would really improve their odds at going against cuda in the coming years.