easybuilders / easybuild-easyconfigs

A collection of easyconfig files that describe which software to build using which build options with EasyBuild.
https://easybuild.io
GNU General Public License v2.0
378 stars 701 forks source link

GCC + OpenMP race conditions with inconsistent linker flags #12563

Open upsj opened 3 years ago

upsj commented 3 years ago

Note that this is not a bug in EasyBuild, but a note for how things can subtly break compared to system GCC

Due to small differences in how EasyBuild builds GCC compared to most packaged system GCC versions, the following program can fail due to race conditions in updates to the std::shared_ptr reference counters since the updates assume that they happen single-threaded, and as such don't use atomics.

testlib.cpp

#include <memory>
void foo(std::shared_ptr<int> f)
{
#pragma omp parallel for
    for (size_t g = 0; g < 100; g++) {
        auto other = f;
    }
}

tester.cpp

#include <memory>
void foo(std::shared_ptr<int> f);
int main() {
        foo(std::make_shared<int>(4));
}

compiling with

g++ -o tester.cpp.o -c tester.cpp
g++ -fPIC -fopenmp -o testlib.cpp.o -c testlib.cpp
g++ -fPIC -shared -o libtestlibd.so testlib.cpp.o -lgomp -lpthread
g++ tester.cpp.o -o tester -Wl,-rpath,`pwd` libtestlibd.so

can cause segfaults and other nice things when running ./tester with multiple threads.

I'll just copy @Flamefire's explanation in here for why this happens:

The problem is indeed a race condition. The c++ headers for threading contain a function gthread_active_p which returns whether the program is multihreaded, and if that returns false, then the atomics decay to regular updates. The problem here comes from the program thinking it is single-threaded, while it is not. This can be checked by running the tester in gdb and b gthread_active_p if gthread_active_ptr==0 I even found it to differ between threads. gcc determines the single/multithread by checking for the symbol pthread_key_create or more explicitely: &gthrw_pthread_key_create which is static typeof(pthread_key_create) _gthrwpthread_key_create attribute ((weakref("__pthread_key_create")));I remember explanations for -pthread vs -lpthread where the former does more than the latter (e.g. on PPC) which sounds similar to this issue. The different values between threads are caused by that variable which is used to check for pthread_key_create beeing a static variable, hence that is local to each C++ file and initialized on program start. So its not different values between threads, but between different calls originating from the main cpp or the sublib cpp From what I can tell the fact that it works on system GCC is likely due to the dynamic linker used to load the linked program. However there are differences. I've created an even simpler test file and linked with system and EB and on system the libraries reported by ldd are a lot less (e.g. libpthread, libstdc++ and more is missing)From -dumpspecs I see a default -as-needed, so I guess that is the difference.

A workaround/fix for these issues is adding -fopenmp, -pthread or -lpthread to your linker flags across all levels of your application that use the OpenMP-enabled library.

Related failure in real code: https://github.com/ginkgo-project/ginkgo/issues/732

boegel commented 3 years ago

@upsj Thanks a lot for getting this "on record".

As far as I can tell, there's nothing to change on the EasyBuild side for this? It's just a weird problem that popped up for (to me at least) a very technical implementation issue?

@Flamefire, @zao: Anything to add? Can we close this now we have it on record?

Flamefire commented 3 years ago

Just a minor clarification:

since the updates assume that they happen single-threaded, and as such don't use atomics.

--> Users are required to link to pthreads if any dependency uses it.

upsj commented 3 years ago

Just a minor comment on the clarification 😄 I think the ODR violation happens because of a pthread symbol, not the std::shared_ptr constructor/destructor, since during object compilation, there is no knowledge about whether the code will be executed sequentially or in concurrently. There is also this related SO answer, which most likely has the same underlying cause and discusses the differences in linker behavior https://stackoverflow.com/a/47241854/8217180

Flamefire commented 3 years ago

Yes and no. The relevant __gthread_active_p called by the constructor/destructor is different due to that being static. So the shared_ptr functions are indeed different and behave differently. So well... Somewhere is an ODR violation ;)

What I found interesting: The answer refers to --as-needed and ld.gold. In EB we use ld-gold by default and at least my system GCC uses --no-as-needed by default, while EB does not. That would likely be something we could change in EB: Build GCC with --no-as-needed too.