ROCm / ROCm-CompilerSupport

The compiler support repository provides various Lightning Compiler related services.
47 stars 31 forks source link

comgr lacks MT safety #27

Closed atamazov closed 1 year ago

atamazov commented 4 years ago

It seems like comgr is not working properly in multithreaded contexts. This prevents it from full-blown usage in MIOpen.

How to reproduce

Prerequisites: A linux machine with gfx900/906/908 GPU card installed (Radeon VII preferred), vanilla ROCm 3.3.

:red_circle: There are two problems:

You can disable MT builds in MIOpen by the following env setting:

export MIOPEN_COMPILE_PARALLEL_LEVEL=1

and see that driver works normally in this case.

scott-linder commented 4 years ago

As an update on this, the first step will be to define the MT safety of all non-trivially-MT-safe APIs in Comgr to be MT-unsafe. Then future work to give APIs the most meaningful/useful safety guarantees can be backwards-compatible.

For now, assume all Comgr APIs are de-facto MT-unsafe.

atamazov commented 1 year ago

@scott-linder I heard that there are some updates wrt this ticket, Can you please shed some light on this?

kzhuravl commented 1 year ago

CC @lamb-j

atamazov commented 1 year ago

CC @averinevg

lamb-j commented 1 year ago

We recently added two patches to Comgr related to this:

Currently we're enforcing thread safety by enclosing the LLVM calls that have unexpected side effects with mutex locks. As more updates are made to upstream LLVM related to these unexpected side effects, we hope to remove these locks to allow more parallelism in a multi-threaded Comgr context.

However there have been a couple of reports of potential bugs with the new Comgr MT approach. I haven't had a chance to fully investigate these yet.

Could you retest MIOpen with a newer Comgr including the above patches?

atamazov commented 1 year ago

@lamb-j

Currently we're enforcing thread safety by enclosing the LLVM calls that have unexpected side effects with mutex locks... However there have been a couple of reports of potential bugs with the new Comgr MT approach.

Just in case, have you taken into account that compilation is often a series of COMgr actions? For example, if you build two HIP sources in parallel, then there are two series of actions:

Without proper serialization these two series or actions will likely run in parallel, which may lead to unexpected side effects.

In MIOpen we can use mutex at higher level (and we did that), which guarantees proper serialization of COMgr calls.

lamb-j commented 1 year ago

@atamazov Good question. If I understand correctly, you're asking about a possible situation like this:

Thread 1 --> AMD_COMGR_ACTION_COMPILE_SOURCE_TO_BC Thread 2 --> AMD_COMGR_ACTION_COMPILE_SOURCE_TO_BC Thread 1 --> AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES Thread 2 --> AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES

Will the interleaving of the two thread's actions have unexpected side effects?

Short answer: They shouldn't.

Long answer: The locks are added at the action level. Comgr doesn't maintain any sort of global state for completed actions, and the LLVM global state is cleared for each action (see COMGR::clearLLVMOptions()). Therefore each action should be completely independent, and it shouldn't make any difference how they're interleaved.

atamazov commented 1 year ago

@lamb-j Good to know that, Jacob. Thanks.

lamb-j commented 1 year ago

@atamazov Closing this for now. Feel free to re-open if you have any other multi-threading issues with Comgr! Hopefully we can move to actual parallel compilation with Comgr in the future (that is, remove the Comgr locks around actions without causing unexpected side effects).

atamazov commented 1 year ago

Basically, this is about actual parallel compilation, because we need it for performance. I recommend re-opening this ticket (I can't).

lamb-j commented 1 year ago

Good point. We can track the performance/parallel issue here:

https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues/57