Optimistic locking to improve concurrent cache misses

frerich commented 8 years ago

This PR attempts to address the issue observed in issue #229 . The theory is that lock timeouts were triggered because -- in direct mode -- clcache holds two locks while invoking the real compiler, one manifest section and one artifact section lock. So despite the fact that there are 256 of each category, if a compiler execution takes sufficiently long, it's very well possible that a concurrent clcache invocation tries to access the same manifest or the same artifact section -- but times out.

The patch supplied by this PR attempts to address this by implementing a more optimistic locking strategy: in case a cache miss is detected, clcache releases all locks, rebuilds, and then acquires all locks again and checks whether the cache still needs updating. This means that there's a miniscule chance that clcache will compile the same object twice (in case two concurrent invocations are exactly equivalent), but it probably optimizes the common code of concurrent invocations being independent.

This behaviour implemented by basically inlining a lot of code in processDirect and processNoDirect and then reshuffling things such that the locks are not acquired while calling the real compiler.

codecov-io commented 8 years ago

Current coverage is 89.25% (diff: 98.11%)

Merging #244 into master will increase coverage by 0.28%

@@             master       #244   diff @@
==========================================
  Files             1          1          
  Lines          1006       1014     +8   
  Methods           0          0          
  Messages          0          0          
  Branches        164        169     +5   
==========================================
+ Hits            895        905    +10   
  Misses           83         83          
+ Partials         28         26     -2

Powered by Codecov. Last update 005e0f7...17bc495

siu commented 8 years ago

This helps in my use case. The version on master requires a huge timeout (600000 ms), with these changes it compiles successfully with the default (10000ms). 👍

About the compilation times: the cold cache case is a bit faster and the hot cache is very similar.

I will have a look at the code too.

siu commented 8 years ago

Here is the result of a run with CLCACHE_PROFILE enabled: coldcache.zip

inorton commented 8 years ago

Nice idea. About your thought that it "might compile the same thing twice". I doubt that should be a worry since even without clcache invoking cl on the same files with the same settings and arguments would cause one to fail anyway.

frerich commented 8 years ago

@siu Thanks a lot for your thorough review, despite the fact that this patch is not exactly trivial.

Also, thanks a lot for actually trying it on a real project and providing the CLCACHE_PROFILE output. I was mostly worried that I somehow missed a case where two cache updates might class, but that appears to not be the case? Alas these concurrency related tickets are very hard for me to debug since the strongest Windows machine I have access to has only two cores, so there's only very limited 'real' concurrency.

The profile you uploaded was quite interesting:

It see that 1214 compile requests were processed over 9904 seconds.
9461 seconds (95.5%) of the time was spent in WaitForSingleObject, and the exclusive caller of that function is subprocess.wait, i.e. 95.5% of the time was spent waiting for the compiler. I wonder whether this allows claiming that clcache currently incurs ~5% overhead for cache misses.
1.9% of the time was spent in createManifestEntry, which is dominated by the calls to getFileHash

My impression is that there is room for improvement -- but not much, at least not for cache misses. I think it's worthwhile to start working on a background process which takes care of hashing files and memorising those hashes; the cache misses would probably benefit a little bit, but it would also be a major improvement to cache hits.

siu commented 8 years ago

Note that WaitForSingleObject is not only called from subprocess.wait, it is the call executed to acquire the locks too. In this case though that entry in the profile seems to be dominated by the subprocess call.

In fact I tried applying some the changes I was suggesting locally and I could not see any relevant change in the profiler. The time spent reading manifests and deciding what to do is negligible compared to the time it takes to compile and it didn't help with a hot cache either.

I think we should profile other use cases before drawing any conclusion. Maybe a mixed execution with only 50% of the files in the cache will lead to more cases of manifest or artifact section clashes and could benefit more from reducing the time that the manifest section is locked. Anyway all this can be done later, for the moment it would just add complexity.

I had a second look at the code. It looks safe in my opinion.

frerich commented 8 years ago

It's true that WaitForSingleObject is used in multiple places. However, in the profile report you generated, I see these lines further down in the report (in the section which depicts from where each function was called):

Function                                                        was called by...
                                                                    ncalls  tottime  cumtime
[...]
subprocess.py:1154(wait)                                        <-    1214    0.078 9460.795  clcache.py:1159(invokeRealCompiler)
{built-in method WaitForSingleObject}                           <-    1214 9460.709 9460.709  subprocess.py:1154(wait)

I.e. in this case, it appears that the only caller of WaitForSingleObject was subprocess.py:1154(wait). And the only caller of that was invokeRealCompiler. Quite frankly, it's not clear why the WaitForSingleObject calls done by CacheLock.acquire don't show up. In fact, there's no mentioning of the CacheLock calls at all. Maybe they were just so fast that the profiler decided to not list them? Strange...

siu commented 8 years ago

Ah, yes, that is true. The missing calls in the profile may be a limitation of cProfile?

frerich commented 8 years ago

Glad to hear that there's mostly positive feedback so far. I'm curious how it works for @akleber who reported #229 in the first place. :-)

akleber commented 8 years ago

Thank you very much @frerich for this PR which I am eager to test but unfortunately I will not be able to test it before 28.11.

frerich commented 8 years ago

@akleber Sure, no rush. For what it's worth, I also just submitted #248 which hopefully is an effective approach to tackling #239 :-)

ivaigult commented 8 years ago

Before this changes I wasn't able to compile our project due to the timeout. I played around with CLCACHE_OBJECT_CACHE_TIMEOUT_MS but it didn't help at all. With this changes everything works fine and I've got 4x build speedup :-). Thank you so much!

sasobadovinac commented 8 years ago

I can also confirm good results with this and the latest master for both #155 and #229 https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.377

akleber commented 8 years ago

@frerich This PR solves my issue reported in #229. It did not improve the overall build time by a significant amount, but it gets rid of the time out. So a big thumbs up for this PR. Thanks. Looking into #248 now...

akleber commented 7 years ago

@frerich I am sorry to report that I messed up my tests with this PR. I by error did not use a clcache version with this change. Sorry about that. I re-run my tests and still get a timeout. However this PR improves my situation as my build passes with a timeout of "only" 20s instead of the 60s I previously had to use. I did not have time to analyze what the timeout situation is yet, but I will look into it.

frerich commented 7 years ago

@akleber No worries, it's of course perfectly possible that there are still other scenarios in which timeouts are triggered unexpectedly. The reports from you and @TiloW in #229 suggest that. I think there is some consensus that this PR implemented a more gentle locking strategy, so it moves things into the right direction. For other issues, we can create separate issues.

frerich / clcache

Optimistic locking to improve concurrent cache misses #244

Current coverage is 89.25% (diff: 98.11%)