GTkorvo / dill

DILL provides instruction-level code generation, register allocation and simple optimizations for generating executable code directly into memory regions for immediate use.
Other
5 stars 6 forks source link

dill build fails on crusher@olcf w/ clang-14.0.0 during ADIOS2 build #27

Open steindev opened 2 years ago

steindev commented 2 years ago

Building ADIOS2 2.8.0 on crusher (https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html) fails during step

[  3%] Building C object thirdparty/dill/dill/CMakeFiles/dill.dir/dill_util.c.o

I cannot draw any useful information from the error output. It is attached together with all output obtained during both configure and build phase.

configure.out.txt configure.err.txt build.out.txt build.err.txt

Disabling SST functionality of ADIOS2 by using option -DADIOS2_USE_SST=OFF when building ADIOS2 workarounds the issue for me (as I do not want to use SST anyways).

I am happy to provide more information, e.g. regarding used modules and compiler flags, if needed.

eisenhauer commented 2 years ago

Actually, yes, modules and compiler flags would be useful. Not sure we can fix this because it looks like a compiler problem, but I wasn’t able to reproduce this on first try.

steindev commented 2 years ago

Here is a build script. I had to add .txt to the filename in order to be able to upload it install-adios2.sh.txt .

eisenhauer commented 2 years ago

So, here's what I've found out. First, this looks to be a compiler bug. llvm seems to be faulting in llvm::BasicBlock::isEntryBlock(). While it does give us an llvm stack dump, there's no real clue as to what in the source code might be triggering the bug. Second, I have modified your script to build dill outside of ADIOS (just substituting the dill github repo for the ADIOS spec), and I don't see the failure in that circumstance. So, while the issue seems to be inside the compiler, the ADIOS configuration may be adding flags or something else that are impacting whether or not it's triggered with this source. This obviously complicates things somewhat.

This is all a bit odd. Dill does dynamic code generation, so has need to use asm() specs and other odd things that might confuse an optimizer. But that generally happens in the machine-specific source files, not dill_util.c, which contains relatively generic code.

I think there are a couple of possible paths forward. Probably one needs to be submitting a bug report on llvm. For that, it would probably be helpful to try to sort out more details, narrowing down to the problematic part of dill_util.c, and or figuring out why this happen during the ADIOS build and not outside of it. The second path might be to sort out a workaround that would let you build ADIOS on crusher. The same experimenting required to narrow down the circumstances of the failure might also lend a clue to how it might be avoided. I'm both the maintainer of Dill and involved in ADIOS, so I'm positioned to work on this, but I've also got a stack of other things to work on, so progress might be slow. To the extent that you can help with the experimentation necessary to see why this is failing sometimes and not others, that would help speed resolution.

khuck commented 2 years ago

Is there an update on this? I ran into the same issue, and I found that updating the ADIOS2/thirdparty/dill/CMakeLists.txt and forcing CMAKE_C_COMPILER to gcc (and CXX to g++ and Fortran to gfortran) allowed the build to continue just fine. Possibly a solution is for ADIOS2 to build dill with the GCC toolchain when configured with clang/clang++/flang and friends. Alternatively...is it possible this code in the main ADIOS2 CMakeLists.txt file could be the culprit?

if(CMAKE_C_COMPILER_ID MATCHES "^(GNU|Intel|Clang|AppleClang|MSVC)$")
  set(ADIOS2_C99_FEATURES c_restrict)
else()
  set(ADIOS2_C99_FEATURES c_std_99)
endif()
khuck commented 2 years ago

update: I changed the ADIOS2 cmake configuration to not force -std=gnu99, and that didn't make a difference. I think this is a clang/llvm bug.

eisenhauer commented 2 years ago

I tend to agree with that diagnosis... If anyone can identify a workaround, I'm happy to push it into dill and then ADIOS. But unfortunately I don't have cycles for it myself at the moment.

jychoi-hpc commented 2 years ago

It looks like "-O3" optimization is causing some problem. RelWithDebInfo or Debug option with cmake, I can compile.

eisenhauer commented 2 years ago

Thanks Jong, that's helpful. I can maybe get on crusher, replicate the problem and see if there's a way to rearrange the code so that it doesn't trigger this compiler bug. Also, a quick google doesn't show any reports of clang segfaulting in IsEntryBlock, so this may be an unreported problem. If I can narrow down the problem I can also likely get a test case to submit a bug report. (Of course, finding a code-based workaround is probably the best short-term solution. Bug reporting is a long play.

eisenhauer commented 2 years ago

OK, I have a code workaround: diff --git a/dill_util.c b/dill_util.c index e50c302..10e0f32 100644 --- a/dill_util.c +++ b/dill_util.c @@ -785,6 +785,7 @@ dill_start_proc(dill_stream s, char name, int ret_type, char arg_str) (s->j->proc_start)(s, name, arg_count, args, NULL); }

+attribute((optnone)) void init_code_block(dill_stream s) {

That is, just add "attribute((optnone))" to the init_code_block() function in dill_util.c. (The problem seems to be arising when that clang tries to inline that function in places where it's called. Adding the optnone disables that.)

I need to see how I can put this in the source so that it doesn't break on non-clang compilers, but that might not happen today as I have other things on my plate. But in the meantime, you have a workaround to compile ADIOS.

khuck commented 2 years ago

@eisenhauer you might be able to wrap it with #ifdef __clang__ / #endif?

eisenhauer commented 2 years ago

The recommendation seems to be like:

ifndef __has_attribute // Optional of course.

define __has_attribute(x) 0 // Compatibility with non-clang compilers.

endif

...

if __has_attribute(optnone)

define OPTNONE_ATTR attribute((optnone))

else

define OPTNONE_ATTR

endif

I'll give that a go and see how far it gets me...

chuckatkins commented 2 years ago

FWIW I can easily reproduce this issue on my local workstation. It fails with the both 5.0.2 and 5.1.2 of the rocm stack, both based on llvm 14, but succeeds with 4.5.2, based on llvm 12. The following patch resolved the issue from the build system end:

diff --git a/thirdparty/dill/dill/CMakeLists.txt b/thirdparty/dill/dill/CMakeLists.txt
index db8183439..60bb35604 100644
--- a/thirdparty/dill/dill/CMakeLists.txt
+++ b/thirdparty/dill/dill/CMakeLists.txt
@@ -56,6 +56,15 @@ if(WIN32)
   endif()
 endif()

+# -O3 is known to crash some flavors of clang 14.
+if(CMAKE_C_COMPILER_ID MATCHES "Clang" AND
+   CMAKE_C_COMPILER_VERSION VERSION_GREATER_EQUAL 14.0.0)
+  foreach(_var_suffix IN ITEMS "" _DEBUG _RELEASE _RELWITHDEBINFO _MINSIZEREL)
+    string(REGEX REPLACE "-O(2|3)" "-O" CMAKE_C_FLAGS${_var_suffix}
+      "${CMAKE_C_FLAGS${_var_suffix}}")
+  endforeach()
+endif()
+
 # Setup shared library defaults.  If explicitly specified somehow, then default 
 # to that.  Otherwise base the default on whether or not shared libs are even
 # supported.

I'm more partial to the build solution over an in-source patch since my gut tells me to not trust that opt level for the rest of dill, even if the compiler doesn't segfault. Up to @eisenhauer though as to which approach he prefers.