frerich / clcache

A compiler cache for MSVC, much like ccache for gcc
Other
325 stars 83 forks source link

CacheLockException: Failed to acquire lock #251

Open TiloW opened 7 years ago

TiloW commented 7 years ago

Here's some general info about our system:

Windows Server 2012 R2 x64 6 GB RAM E5-2650

We are using this machine for CI. Build duration is usually 5 to 8 minutes and mostly spent on compilation. Every build triggers up to 4 compiler calls in parallel and there can be two parallel builds (up to 8 concurrent compiler calls).

Since trying out clcache for a few days we have disabled it again because of cache errors still existing after #244.

Here's the error message we are receiving every now and then:

  Traceback (most recent call last):
    File "clcache.py", line 1672, in <module>
    File "clcache.py", line 1509, in main
    File "clcache.py", line 1542, in processCompileRequest
    File "clcache.py", line 1622, in processDirect
    File "clcache.py", line 290, in __enter__
    File "clcache.py", line 314, in acquire
  __main__.CacheLockException: Failed to acquire lock Local\C--Users-ogdf-clcache-manifests-45 after 30000ms; try setti
  ng CLCACHE_OBJECT_CACHE_TIMEOUT_MS environment variable to a larger value.
  Failed to execute script clcache

As suggested in #229 I've upped the CLCACHE_OBJECT_CACHE_TIMEOUT_MS to 120.000 ms which seems to cause our builds to succeed. I've noted that the error seems to only occur when there are cache misses. Our cache is pretty much filled.

 current cache dir         : C:\Users\ogdf\clcache
 cache size                : 10,281,914,308 bytes
 maximum cache size        : 10,737,418,240 bytes
 cache entries             : 9229
 cache hits                : 63053
 cache misses
   total                      : 33459
   evicted                    : 383
   header changed             : 22131
   source changed             : 10945
 passed to real compiler
   called w/ invalid argument : 0
   called for preprocessing   : 0
   called for linking         : 0
   called for external debug  : 692
   called w/o source          : 4
   called w/ multiple sources : 0
   called w/ PCH              : 0

I've re-enabled clcache in our CI now and will report if any errors arise with CLCACHE_OBJECT_CACHE_TIMEOUT_MS=120000.

I don't know how exactly your locking mechanism works but already a default timeout of 10 seconds seems extremly high to me. You stated that "it might be the (unlikely) case that multiple clcache invocations are accessing the same manifest section". How is this section determined?

TiloW commented 7 years ago

I've re-enabled clcache in our CI now and will report if any errors arise with CLCACHE_OBJECT_CACHE_TIMEOUT_MS=120000.

I've just taken a look at our CI builds and seen that our build time jumps from 7 to 23 minutes when enabling clcache and there are no hits. If the build is fully cached It's roughly equal to not using clcache. This might be related to #239.

Sadly, this is too much of a performance impact and forces us to disable clcache. If you need more data to track down these problems I could occasionally enable clcache for testing.

frerich commented 7 years ago

Clearly, a factor three slowdown when using clcache with a cold cache is unacceptable, but it's also a hint that something is not working as intended in your specific use case - I'd love to learn what's going on!

And indeed, I agree that even the default timeout is very large (I never needed a timeout that large for my own builds).

For what it's worth, synchronising concurrent accesses to the cache is implemented by dividing the cache into 256 'sections' and then maintaining a lock for each section. The 'name' of a section is just the first two characters of the hash sum computed for the invocation, a task executed by the ManifestRepository.getManifestHash function. Given the path to the compiler binary, the command line and the source file being compiled, it computes a hash sum for this invocation. The assumption is that this hashing results in evenly distributed values to reduce the chance of two concurrent invocations locking the same section.

In your error message it's visible that the lock being acquired (this is implemented in terms of the CreateMutex function) has the name

    Local\C--Users-ogdf-clcache-manifests-45

...which means that the path being locked is C:\Users\ogdf\clcache\manifests\45. For some reason, one of the clcache invocation fails to acquire the lock for a long time.

TiloW commented 7 years ago

For what it's worth, synchronising concurrent accesses to the cache is implemented by dividing the cache into 256 'sections' and then maintaining a lock for each section.

Okay, so the chance of not aquiring the same lock when executing exactly n builds in parallel is given by this,+n%3D0...10) (assuming sections are chosen uniformly). As you can see, already for 10 parallel calls there is a chance of about 20% for requesting the same section at least once. I assume this occurs quite often in practice and doesn't matter if CLCACHE_OBJECT_CACHE_TIMEOUT_MS is larger than your maximum compile time for a single object.

frerich commented 7 years ago

As you can see, already for 10 parallel calls there is a chance of about 20% for requesting the same section at least once.

Indeed, for 10 concurrent calls, there is a ~17% chance that at least two of them access the same of 256 uniformly distributed sections.

However, each section should only be held for a very short time (typically in the order of < 150ms). In particular, note that no lock is held while the compiler runs. The relevant code here is the processDirect function. It first acquires a manifest lock to see whether there's an existing entry. If not, it releases all locks again (which should happen due to the with statement), invokes the real compiler, then acquires the lock again and adds the entry (unless it's there already due to another clcache invocation).

If you find that the default timeout of 10 seconds does not help with 8 concurrent compilers, then I suspect there is something else fishy.

My gut feeling is that there may be a deadlock due to a wrong order in which the locks are acquired and released (though I could not find any such place in the code after a review), or that there is some lock handle not being released properly, or maybe some operation which is performed under the lock is a lot slower for you than expected.

You don't happen to have the cache directory on a network drive or some aggressive virus scanner which tracks file accesses, do you? :-)

frerich commented 7 years ago

In case anybody would be willing to somehow get me access to a Windows machine which can run eight or more jobs in parallel, that would be great. I always have a hard time testing this as the only machines I have access to are dual cores.

I hoped that maybe Microsoft Azure features something simple for the sake of developing or testing open source projects, but couldn't find anything.

TiloW commented 7 years ago

you don't happen to have the cache directory on a network drive or some aggressive virus scanner which tracks file accesses, do you? :-)

Good remark. Our virus scanner was indeed configured rather agressively. Disabling this behavior improves runtime when using clcache. I am currently running some benchmarks (also regarding the caching of header files) and will report back soon.

frerich commented 7 years ago

Good remark. Our virus scanner was indeed configured rather agressively. Disabling this behavior improves runtime when using clcache.

Interesting -- what virus scanner is that (and what setting did you use)? This might be something worth adding to the Caveats section of the Wiki.

TiloW commented 7 years ago

what virus scanner is that (and what setting did you use)?

Even though this is probably me being overly precautious I would prefer not to talk about our security in public. :) It should generally help to discourage virus scanners from inspecting object files upon being touched by clcache.

frerich commented 7 years ago

Did reconfiguring your anti virus software resolve the cache lock exceptions?

frerich commented 7 years ago

@TiloW ping? I wonder what to do about this ticket. :-)

TiloW commented 7 years ago

Can't give any new info yet. I'll have to do some more testing and will report back.

TiloW commented 7 years ago

I've run some tests but didn't come up with any issues. I've even switched to the old state (fe99fa6), used excessive on-access virus scanning, and lowered the timeout to 500ms without reproducing this problem. Could be that it only occurs when the cache is full or maybe only with corrupted caches (even though I wouldn't know how that could have happened).

Seems I've got some more testing to do.

TiloW commented 7 years ago

Happened again, might be related to un-setting CLCACHE_SERVER.

dengste commented 7 years ago

Same problem here on a Dual-Xeon machine with 8 (real) cores and msbuild running with /m:8, even with CLCACHE_OBJECT_CACHE_TIMEOUT_MS set to 120000. It usually happens when cache is cold, which I guess makes sense. Virus scanner is disabled for all clcache directories. Unfortunately these are internal build machines, so I cannot give access to them. Is there some way that I could patch clcache to output which other process is currently holding the lock?

frerich commented 7 years ago

Is there some way that I could patch clcache to output which other process is currently holding the lock?

Indeed, there was a branch at some point which introduced some additional logging specifically for the locking. Alas, I cannot find it anymore and now I regret not having integrated that as a real feature. 😢

A quick patch might be to introduce calls to the printTraceStatement function the CacheLock.acquire and CacheLock.release methods. These two are responsible for all the locking. Printing the name of the lock (available via the _mutexName field) and the PID (available via os.getpid()) makes it easy to disambiguate the calls.

I'll hopefully get around to building a patch which implements this idea soon, unless you beat me to it.

frerich commented 7 years ago

@dengste @TiloW I now got around to adding a patch which adds some fine-grained debug output which hopefully helps with narrowing down locking issues. It's in the debug_mutex_locking branch -- it's nothing fancy but just some print statements in a few neuralgic places. In case you hit a reproducible case in which the locks trigger a timeout, using that branch and then uploading the generated output somewhere would be great, I think it should be enough to do some sensible post-mortem debugging.

dengste commented 7 years ago

Thanks! I've deployed this version to our build slaves and I see lots of mutex-related debug output. I guess I'll also clear the cache regularly to provoke the error.

TiloW commented 7 years ago

@frerich good job!

uploading the generated output somewhere would be great

Will do!

dengste commented 7 years ago

I reduced the timeout again to 10s, which I think is already plenty, and I got the error. I cleaned up the log and reduced it to the one problematic compiler run, which has the id '59':

https://gist.github.com/dengste/763fb957a6806e984ac6bf556cde5629

As you can see, the thread TID=3648 goes wild an acquires locks for all manifests and objects and holds them for about 30seconds.

frerich commented 7 years ago

@dengste Thanks! I think what's happening is that the automatic cleaning of the cache (in case the cache exceeds the defined maximum size) is taking very long (maybe too long), triggering this.

TID 3548 is the main thread of process 7632. The invocation builds 18 C++ files and thus launches 18 threads to handle the individual source files concurrently (see the scheduleJobs function). Each of the jobs returns (among other things) a flag indicating whether the cache size was exceeded by running the job.

After all jobs finished, clcache considers (in the main thread, TID 3648) whether any job indicated that a cleanup is required and if so, goes ahead and does the same thing as running clcache.py -c does: it deletes the oldest cached files (this part is apparently defect on more recent Windows versions, see #253) until the size of the cache is 90% of the maximum size. Trimming the cache requires locking all manifests and all sections (as well as the statistics file) because it's not known up front which sections will be affected by the trimming. That's where this "going wild with acquiring locks" comes wrong, it's the effect of the Cache.lock method.

If this theory is sound, then I believe that

Furthermore, if this theory makes sense, then it would be good to figure out why clean is so slow.

dengste commented 7 years ago

Ooh, it's the cache cleanup. That makes sense, the object cache is indeed close to 1GB. I completely forgot about that.

As for why it is taking so long: First and foremost, the I/O on those servers is heavily targeted towards reliabilty, not speed, meaning we're dealing with RAIDs that either have no write cache at all, or at least a pretty small one that is regularly flushed. I assume that cleaning up the cache involves a lot of 'stat' calls on small files in a huge directory tree, which is pretty much a worst-case scenario for NTFS and the underlying hardware. Also, you must keep in mind that this job I snipped out is just one of many that are running in parallel. So while the cache cleanup is happening, many other clcache jobs try to access the object cache in the same time.

Since I have lots of HD space I will probably just increase the cache size and maybe once in a month clean the cache while the servers are idle. Let's see if the problem is gone then.

frerich commented 7 years ago

@dengste Any news on this -- did increasing the cache size and cleaning the cache when it's idle help? If so, this might be something for the Caveats page and cleaning the cache explicitly when there is no build going on would also be my proposed solution.

dengste commented 7 years ago

@frerich I haven't seen the error since, so this definitely fixed it for me. I still had some problems that clcache.exe would sometimes return with error code -1, but without any visible error message. I suspect that this is some problem with the exe generated by pyinstaller and the different Python versions we have installed on our build servers, so I went back to use a simple batch file that directly calls python on clcache.py, and now everything works great! Thanks a lot for your help!

frerich commented 7 years ago

@dengste Thanks for the positive feedback! I wonder how to 'fix' this issue... I'm not sure I can avoid locking the entirety of the cache for the sake of cleaning. I suppose one improvement would be to yield better diagnostics, or to use a larger timeout in case it is detected that a cache cleansing is in process. I could also add an entry to the Caveats page. What do you think?

Seeing that the atime is apparently not updated by default on recent Windows versions, the whole approach of using a filesystem-based cache is a bit absurd anyway. An interesting experiment might be to use some sort of database (SQlite? Or maybe PostgreSQL or so, for the sake of sharing caches among multiple machines). I believe those databases have the whole locking stuff sorted out already.

dengste commented 7 years ago

Not sure how to really fix this. Maybe you could just disable the timeout while the cache is being cleaned? But then I would suggest to reduce the cache size on each clean by much more than just 10% - more like 50%, otherwise you'd get long hangs during compile pretty frequently. Maybe make it configurable? Regarding using a database: couldn't you pretty easily upgrade the hash server with some database-like capabilities? I guess it could easily update a list of least-used objects, and it could also do the cleaning - maybe in one huge swoop when the machine is idle, or maybe incrementally (you'd need an inter-process mutex for this, of course). Unfortunately, I currently cannot use the server because of issue #264... My "solution" will probably be to use a RAM-Disk for the object cache. Our build servers have huge amounts of RAM, so it's no problem for me to snip away 8G or so. And it will automatically be cleaned when the machine reboots, which usually happens once a month on patch day.

camelcoding commented 7 years ago

Hi, can this problem being avoided if i use this new feature -> Feature: clcache now supports using a memcached setup as the storage mechanism in addition to storing cached files in the file system ?

Kojoley commented 6 years ago

I recently hit this problem on Appveyor (only 2 threads, CLCACHE_SERVER enabled).

I think this could be solved in this way: pruning process should signal other processes that he is working (alive and progressing) and the waiters should not timeout on lock until prunner is dead.

ivaigult commented 6 years ago

@frerich Can we just fall back to real compiler as temporary workaround?

frerich commented 6 years ago

@ivaigult I'm not sure, do you have a specific patch in mind?

marcelotrevisani commented 6 years ago

Does anyone have a solution or a workaround for this bug?

siu commented 6 years ago

Maybe the cache could be cleaned locking each object and manifest section independently.

It could do a first pass collecting which files should be removed and a second pass acquiring the locks + removing the files. I'm am not sure if the first pass would need to acquire all the locks.

It would not provide the same guarantees as the current implementation which ensures that the correct files are removed atomically to other clcache processes. This is probably not too important, if the cached files change between the first and the second pass it will do a suboptimal cleanup but it would not corrupt the cache.

This strategy could work if removing the files is what takes most of the time.

frerich commented 6 years ago

Another crazy idea: when cleaning the cache, we rename the cache directory to a temporary name, then perform the cleaning in that temporary location, and then merge whatever cache entries concurrent clcache invocations may have created into the cleansed repository while the cache was cleaned. This might have a slightly negative impact on the cache hit rate (since during the cleaning, the cache appears to be empty, so objects get recreated), but it would probably decouple the clcache instances quite nicely. I'm not sure how easy it is to merge two caches though.

kayhayen commented 6 years ago

I am seeing this one pretty often now:

clcache.main.CacheLockException: Failed to acquire lock Local\/c/Users/kayha-clcache-manifests-58 after 10000ms; try setting CLCACHE_OBJECT_CACHE_TIMEOUT_MS environment variable to a larger value.

Actually I had single builds for Nuitka where it happened for 8 files. Now, I am getting these spuriously in my CI, but not just occasionally. My CI is often running all day, and it happens multiple times. I have seen instances where 6 calls in the same compilation of Scons failed all with that error, which felt pretty remarkable, all for different hash codes.

Nuitka means to use a libnuitka to avoid re-compiling same files, but right now compiles the same C (8 maybe) files over and over for every compilation. Compilations use CPU core count jobs, which might be 8 on my machine, and typically there are 2 jobs running at the same time, so collisions are high.

I was reading about a daemon of clcache, and wondering if it makes the locking problem go away, I didn't immediately employ it, because of additional setup. Do you think that is going to avoid the issue?

frerich commented 6 years ago

@kayhayen The clcachesrv caching demon only avoids unneeded re-hashing of header files. In general, since this is a timing-dependent issue, it could make things better or worse. I don't expect any signifiant improvement though (but I'd love to be mistaken).

kayhayen commented 6 years ago

@frerich I have a suspect that the clcache default folder that I use lives outside of what my Buildbot worker uses, and that Windows Defender is causing the timeouts. I have add similar issues with other files and attempts to unlink them.

I will now set the environment variable to point inside the already ignored folder. Even if only to clean things up in terms of that user directory folder, my hope is to avoid contention with Windows Defender.

9a4gl commented 5 years ago

Can we have CLCACHE_DISABLE_AUTOCLEAN ? We can run clcache -c before build, and if cache grows a bit over 100%, who cares. It is much better then crash during build. I will test commits more following days.

Also if would have clcache -c 70 to reduce cache size to other then 90% we could easly avoid cleaning during build.

9a4gl commented 5 years ago

I have created 3 commits that implement my above ideas, feel free to use it:

  1. Clean storage by arbitrary percentage https://github.com/9a4gl/clcache/commit/9fd70d59de9602ea4d64c7da21ba68c7e6453e82

  2. Disable auto clean during build if CLCACHE_DISABLE_AUTOCLEAN defined https://github.com/9a4gl/clcache/commit/d03ff7d679a8255d1a313f974e3c66f1ac6ca447

  3. Force storage cleaning if cleaning from command line https://github.com/9a4gl/clcache/commit/030993b44dbac29d92e14186605b8f0cf0ece4fd

Artalus commented 4 years ago

We run into this issue almost immediately after enabling clcache in all our CI builds instead of select few. We have max cache size about 16Gb with a single clean build producing about 4Gb of object files. The cache folder is reused between building on several branches.

I am not sure, but if the hypothesis about long cache cleanup is true, then I don't understand why it didn't happen even once when I had max cache size standard at 1Gb and the object files total size was even larger.

Artalus commented 4 years ago

Might be of interest:

clcache.__main__.CacheLockException: Failed to acquire lock Local\E--jenkins-workspace-build-phab-phab-builder-trassir120-__build_cmake_trassir120-debug.clcache-manifests-13 after 10000ms

In 13 folder there is 13ef76f673240777fb01afc6e562abab.json of 9 MB size and 82146 lines. The next time it failed on 17 folder: and there is 177c13a32b186e4c73db46d53c905999.json of 12 MB size. It contains 108058 lines, mostly of various headers. Parsing it shows that there are 69 objects in entries, each having 1560 strings in includeFiles.

Could it be that the lock acquiring fails due failing to process data in said excessive json in time?

Artalus commented 4 years ago

Okay, so it seems that our issue is more related to #347.

We managed to mitigate the problem by increasing timeout up to 60000 ms. Profiling with ninjatracing revealed that multiple source files took up to 40-60 seconds to compile. Further profiling with CLCACHE_PROFILE (involving #363 and #364) showed that such compilation units spend most of the time in processDirect -> getIncludesContentHashForFiles -> getFileHashes, which supports the theory about large json maniphests being the problem.

Screenshot_2019-11-18_17-57-02

I believe that clcache accumulates too many additional entries for the maniphest file (corresponding to command line) each time one of "dependency headers" of the compiled file changes. It also seems that there is some inefficiency due to hash being recalculated multiple time for the same headers mentioned in different entries:

rasjani commented 4 years ago

EDIT Not justs cache stats.txt - just got few cachelockexception's from other parts.. so my "theory" can be thrown out .. Sorry for noise. EDIT I've been running my own fork of clcache for few months, with few patches mentioned in this issue (disable auto cleaning) and still getting these CacheLockExceptions.. @frerich mentioned following few years back:

In your error message it's visible that the lock being acquired (this is implemented in terms of the CreateMutex function) has the name

    Local\C--Users-ogdf-clcache-manifests-45

Now, this is not what happens in my case, for me the lock is for stats.txt:

clcache.__main__.CacheLockException: Failed to acquire lock Local\J--j-workspace-CompanyNameSetup_master_gerrit-ccache-stats.txt after 30000ms; try setting CLCACHE_OBJECT_CACHE_TIMEOUT_MS environment variable to a larger value.

Past 3-5 fails so far and its always the stats.txt file and lock is being acquired @ https://github.com/rasjani/clcache/blob/alt-release/clcache/__main__.py#L1867

Our build is driven with Ninja so there's heavy parallerism. Stats are not that important - would it make sense (as a hack) to just ignore the lock to update the stats and maybe just issue a warning the stats are out of date ? Essentially, i dont need this sort of change to land to this repo, i can just add it to my own fork and wait for proper fixes if and when those might come.

Kojoley commented 4 years ago

@rasjani Do you use clcache-server? I was expecting it to make things better.

rasjani commented 4 years ago

@Kojoley - been running with the server enabled for few weeks and still seeing random cachelock exceptions.. However, last 3 where all about updating stats.txt ..