`SimpleRWLock::EnterRead` shows up high in CPU usage in some cases

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

15.14k stars 4.71k forks source link

`SimpleRWLock::EnterRead` shows up high in CPU usage in some cases #65199

Open kouvel opened 2 years ago

kouvel commented 2 years ago

SimpleRWLock seems to be a purely spin-wait-based data structure, with long spin-waits between attempts. In some cases during contention the spin-waiting CPU time is showing up near the top of methods in exclusive CPU time spent. It should be possible to tweak the spin-waits to be shorter, and maybe also back-off to a full wait in some cases.

ghost commented 2 years ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details

`SimpleRWLock` seems to be a purely spin-wait-based data structure, with long spin-waits between attempts. In some cases during contention the spin-waiting CPU time is showing up near the top of methods in exclusive CPU time spent. It should be possible to tweak the spin-waits to be shorter, and maybe also back-off to a full wait in some cases.

Author:	kouvel
Assignees:	-
Labels:	`area-System.Threading`
Milestone:	7.0.0

jkotas commented 1 year ago

What are the hot stacktraces? Would it be better to get rid of this lock in the offending places instead?

kouvel commented 1 year ago

From the history I have, it was apparently contention in the type loader, under JIT_GenericHandleClass / JIT_GenericHandle_Framed, probably during startup. I don't have any history on whether an RW lock would benefit over a simpler lock in this case. It may be possible to create a small test case to show the issue and measure.

jkotas commented 1 year ago

JIT_GenericHandleClass / JIT_GenericHandle_Framed

Hmm, g_pJitGenericHandleCache that backs implementation of these APIs is lock-free for read and it does not use SimpleRWLock.

kouvel commented 1 year ago

Here's a longer caller stack that I have from before:

- SimpleRWLock::EnterRead
- LoaderAllocator::GetDispatchToken
- VirtualCallStubManager::GetCallStub
- Dictionary::PopulateEntry
- JIT_GenericHandle_Framed
- JIT_GenericHandleClass

kouvel commented 1 year ago

SimpleRWLock::EnterRead in that case was using about half the CPU time of CLRLifoSemaphore::Wait, which seemed to be high

jkotas commented 1 year ago

If the dispatch token hashtable is hot in some scenarios, we should be able to switch it to be lock-free on read hashtable.

davidwrighton commented 1 year ago

I believe we only use a SimpleRWLock in those cases when creating/working with a FatDispatchToken, and I removed the need for fat dispatch tokens on 64 bit platforms with PR #65045. My guess is that we probably don't have this particular problem with this callstack on X64 or Arm64 platforms anymore.