devinamatthews / tblis

TBLIS is a library and framework for performing tensor operations, especially tensor contraction, using efficient native algorithms.
BSD 3-Clause "New" or "Revised" License
114 stars 29 forks source link

Support for multiple OpenMP runtimes in the same process #47

Open rohany opened 2 years ago

rohany commented 2 years ago

I'm using TBLIS in a system that has support for running multiple OpenMP runtimes within the same process, which is somewhat unusual. I'm tracking down some weird performance issues (https://github.com/StanfordLegion/legion/issues/1266) when using TBLIS in this situation, and am wondering if there are some architectural issues within TBLIS (such as global state / locks) that could cause interference between independent TBLIS calls on these different OpenMP runtimes.

devinamatthews commented 2 years ago

During normal computation, the only global locking is done when checking out a block of memory from the global pool. This should scale roughly the same as if you were using one OpenMP runtime across all of the cores in the first place. I'm not exactly sure what "running multiple OpenMP runtimes within the same process" even means, though. Anything else works explicitly amongst the threads spawned by #pragma omp parallel, so if those thread groups are distinct then there shouldn't be any performance impact.

jeffhammond commented 2 years ago

Multiple OpenMP runtimes in a single process is not a legal use case for OpenMP. Nothing is the specification required it to work and there are good reasons why it cannot work.

You'll find that if you use the KMP runtime in Intel and LLVM, it supports the GOMP symbols required to interoperate with OpenMP. Just make sure only KMP is in the library load path.

You might need to set an Intel compiler flag to force the use of GOMP symbols instead of IOMP5 for this to work perfectly.

NVHPC (formerly PGI) also has a GOMP compatible runtime. The same caveat about a compiler flag for the OpenMP runtime ABI may apply.

You'll find interoperability is more robust for common OpenMP features. Some parts of OpenMP 4+ tasking and target offload are less reliable when mixing runtimes, but I don't think that's relevant to TBLIS.

jeffhammond commented 2 years ago

Related:

https://www.openmp.org/spec-html/5.0/openmpsu152.html#x189-8940003.2.43

https://link.springer.com/chapter/10.1007/978-3-319-45550-1_14