Open TiloW opened 7 years ago
I've re-enabled clcache in our CI now and will report if any errors arise with CLCACHE_OBJECT_CACHE_TIMEOUT_MS=120000.
I've just taken a look at our CI builds and seen that our build time jumps from 7 to 23 minutes when enabling clcache and there are no hits. If the build is fully cached It's roughly equal to not using clcache. This might be related to #239.
Sadly, this is too much of a performance impact and forces us to disable clcache. If you need more data to track down these problems I could occasionally enable clcache for testing.
Clearly, a factor three slowdown when using clcache with a cold cache is unacceptable, but it's also a hint that something is not working as intended in your specific use case - I'd love to learn what's going on!
And indeed, I agree that even the default timeout is very large (I never needed a timeout that large for my own builds).
For what it's worth, synchronising concurrent accesses to the cache is implemented by dividing the cache into 256 'sections' and then maintaining a lock for each section. The 'name' of a section is just the first two characters of the hash sum computed for the invocation, a task executed by the ManifestRepository.getManifestHash
function. Given the path to the compiler binary, the command line and the source file being compiled, it computes a hash sum for this invocation. The assumption is that this hashing results in evenly distributed values to reduce the chance of two concurrent invocations locking the same section.
In your error message it's visible that the lock being acquired (this is implemented in terms of the CreateMutex
function) has the name
Local\C--Users-ogdf-clcache-manifests-45
...which means that the path being locked is C:\Users\ogdf\clcache\manifests\45
. For some reason, one of the clcache
invocation fails to acquire the lock for a long time.
For what it's worth, synchronising concurrent accesses to the cache is implemented by dividing the cache into 256 'sections' and then maintaining a lock for each section.
Okay, so the chance of not aquiring the same lock when executing exactly n
builds in parallel is given by this,+n%3D0...10) (assuming sections are chosen uniformly). As you can see, already for 10 parallel calls there is a chance of about 20% for requesting the same section at least once.
I assume this occurs quite often in practice and doesn't matter if CLCACHE_OBJECT_CACHE_TIMEOUT_MS
is larger than your maximum compile time for a single object.
As you can see, already for 10 parallel calls there is a chance of about 20% for requesting the same section at least once.
Indeed, for 10 concurrent calls, there is a ~17% chance that at least two of them access the same of 256 uniformly distributed sections.
However, each section should only be held for a very short time (typically in the order of < 150ms). In particular, note that no lock is held while the compiler runs. The relevant code here is the processDirect
function. It first acquires a manifest lock to see whether there's an existing entry. If not, it releases all locks again (which should happen due to the with
statement), invokes the real compiler, then acquires the lock again and adds the entry (unless it's there already due to another clcache invocation).
If you find that the default timeout of 10 seconds does not help with 8 concurrent compilers, then I suspect there is something else fishy.
My gut feeling is that there may be a deadlock due to a wrong order in which the locks are acquired and released (though I could not find any such place in the code after a review), or that there is some lock handle not being released properly, or maybe some operation which is performed under the lock is a lot slower for you than expected.
You don't happen to have the cache directory on a network drive or some aggressive virus scanner which tracks file accesses, do you? :-)
In case anybody would be willing to somehow get me access to a Windows machine which can run eight or more jobs in parallel, that would be great. I always have a hard time testing this as the only machines I have access to are dual cores.
I hoped that maybe Microsoft Azure features something simple for the sake of developing or testing open source projects, but couldn't find anything.
you don't happen to have the cache directory on a network drive or some aggressive virus scanner which tracks file accesses, do you? :-)
Good remark. Our virus scanner was indeed configured rather agressively. Disabling this behavior improves runtime when using clcache. I am currently running some benchmarks (also regarding the caching of header files) and will report back soon.
Good remark. Our virus scanner was indeed configured rather agressively. Disabling this behavior improves runtime when using clcache.
Interesting -- what virus scanner is that (and what setting did you use)? This might be something worth adding to the Caveats section of the Wiki.
what virus scanner is that (and what setting did you use)?
Even though this is probably me being overly precautious I would prefer not to talk about our security in public. :) It should generally help to discourage virus scanners from inspecting object files upon being touched by clcache
.
Did reconfiguring your anti virus software resolve the cache lock exceptions?
@TiloW ping? I wonder what to do about this ticket. :-)
Can't give any new info yet. I'll have to do some more testing and will report back.
I've run some tests but didn't come up with any issues. I've even switched to the old state (fe99fa6), used excessive on-access virus scanning, and lowered the timeout to 500ms without reproducing this problem. Could be that it only occurs when the cache is full or maybe only with corrupted caches (even though I wouldn't know how that could have happened).
Seems I've got some more testing to do.
Happened again, might be related to un-setting CLCACHE_SERVER
.
Same problem here on a Dual-Xeon machine with 8 (real) cores and msbuild running with /m:8, even with CLCACHE_OBJECT_CACHE_TIMEOUT_MS set to 120000. It usually happens when cache is cold, which I guess makes sense. Virus scanner is disabled for all clcache directories. Unfortunately these are internal build machines, so I cannot give access to them. Is there some way that I could patch clcache to output which other process is currently holding the lock?
Is there some way that I could patch clcache to output which other process is currently holding the lock?
Indeed, there was a branch at some point which introduced some additional logging specifically for the locking. Alas, I cannot find it anymore and now I regret not having integrated that as a real feature. 😢
A quick patch might be to introduce calls to the printTraceStatement
function the CacheLock.acquire
and CacheLock.release
methods. These two are responsible for all the locking. Printing the name of the lock (available via the _mutexName
field) and the PID (available via os.getpid()
) makes it easy to disambiguate the calls.
I'll hopefully get around to building a patch which implements this idea soon, unless you beat me to it.
@dengste @TiloW I now got around to adding a patch which adds some fine-grained debug output which hopefully helps with narrowing down locking issues. It's in the debug_mutex_locking
branch -- it's nothing fancy but just some print
statements in a few neuralgic places. In case you hit a reproducible case in which the locks trigger a timeout, using that branch and then uploading the generated output somewhere would be great, I think it should be enough to do some sensible post-mortem debugging.
Thanks! I've deployed this version to our build slaves and I see lots of mutex-related debug output. I guess I'll also clear the cache regularly to provoke the error.
@frerich good job!
uploading the generated output somewhere would be great
Will do!
I reduced the timeout again to 10s, which I think is already plenty, and I got the error. I cleaned up the log and reduced it to the one problematic compiler run, which has the id '59':
https://gist.github.com/dengste/763fb957a6806e984ac6bf556cde5629
As you can see, the thread TID=3648 goes wild an acquires locks for all manifests and objects and holds them for about 30seconds.
@dengste Thanks! I think what's happening is that the automatic cleaning of the cache (in case the cache exceeds the defined maximum size) is taking very long (maybe too long), triggering this.
TID 3548 is the main thread of process 7632. The invocation builds 18 C++ files and thus launches 18 threads to handle the individual source files concurrently (see the scheduleJobs
function). Each of the jobs returns (among other things) a flag indicating whether the cache size was exceeded by running the job.
After all jobs finished, clcache
considers (in the main thread, TID 3648) whether any job indicated that a cleanup is required and if so, goes ahead and does the same thing as running clcache.py -c
does: it deletes the oldest cached files (this part is apparently defect on more recent Windows versions, see #253) until the size of the cache is 90% of the maximum size. Trimming the cache requires locking all manifests and all sections (as well as the statistics file) because it's not known up front which sections will be affected by the trimming. That's where this "going wild with acquiring locks" comes wrong, it's the effect of the Cache.lock
method.
If this theory is sound, then I believe that
Furthermore, if this theory makes sense, then it would be good to figure out why clean
is so slow.
Ooh, it's the cache cleanup. That makes sense, the object cache is indeed close to 1GB. I completely forgot about that.
As for why it is taking so long: First and foremost, the I/O on those servers is heavily targeted towards reliabilty, not speed, meaning we're dealing with RAIDs that either have no write cache at all, or at least a pretty small one that is regularly flushed. I assume that cleaning up the cache involves a lot of 'stat' calls on small files in a huge directory tree, which is pretty much a worst-case scenario for NTFS and the underlying hardware. Also, you must keep in mind that this job I snipped out is just one of many that are running in parallel. So while the cache cleanup is happening, many other clcache jobs try to access the object cache in the same time.
Since I have lots of HD space I will probably just increase the cache size and maybe once in a month clean the cache while the servers are idle. Let's see if the problem is gone then.
@dengste Any news on this -- did increasing the cache size and cleaning the cache when it's idle help? If so, this might be something for the Caveats page and cleaning the cache explicitly when there is no build going on would also be my proposed solution.
@frerich I haven't seen the error since, so this definitely fixed it for me. I still had some problems that clcache.exe would sometimes return with error code -1, but without any visible error message. I suspect that this is some problem with the exe generated by pyinstaller and the different Python versions we have installed on our build servers, so I went back to use a simple batch file that directly calls python on clcache.py, and now everything works great! Thanks a lot for your help!
@dengste Thanks for the positive feedback! I wonder how to 'fix' this issue... I'm not sure I can avoid locking the entirety of the cache for the sake of cleaning. I suppose one improvement would be to yield better diagnostics, or to use a larger timeout in case it is detected that a cache cleansing is in process. I could also add an entry to the Caveats page. What do you think?
Seeing that the atime
is apparently not updated by default on recent Windows versions, the whole approach of using a filesystem-based cache is a bit absurd anyway. An interesting experiment might be to use some sort of database (SQlite? Or maybe PostgreSQL or so, for the sake of sharing caches among multiple machines). I believe those databases have the whole locking stuff sorted out already.
Not sure how to really fix this. Maybe you could just disable the timeout while the cache is being cleaned? But then I would suggest to reduce the cache size on each clean by much more than just 10% - more like 50%, otherwise you'd get long hangs during compile pretty frequently. Maybe make it configurable? Regarding using a database: couldn't you pretty easily upgrade the hash server with some database-like capabilities? I guess it could easily update a list of least-used objects, and it could also do the cleaning - maybe in one huge swoop when the machine is idle, or maybe incrementally (you'd need an inter-process mutex for this, of course). Unfortunately, I currently cannot use the server because of issue #264... My "solution" will probably be to use a RAM-Disk for the object cache. Our build servers have huge amounts of RAM, so it's no problem for me to snip away 8G or so. And it will automatically be cleaned when the machine reboots, which usually happens once a month on patch day.
Hi, can this problem being avoided if i use this new feature -> Feature: clcache now supports using a memcached setup as the storage mechanism in addition to storing cached files in the file system ?
I recently hit this problem on Appveyor (only 2 threads, CLCACHE_SERVER
enabled).
I think this could be solved in this way: pruning process should signal other processes that he is working (alive and progressing) and the waiters should not timeout on lock until prunner is dead.
@frerich Can we just fall back to real compiler as temporary workaround?
@ivaigult I'm not sure, do you have a specific patch in mind?
Does anyone have a solution or a workaround for this bug?
Maybe the cache could be cleaned locking each object and manifest section independently.
It could do a first pass collecting which files should be removed and a second pass acquiring the locks + removing the files. I'm am not sure if the first pass would need to acquire all the locks.
It would not provide the same guarantees as the current implementation which ensures that the correct files are removed atomically to other clcache processes. This is probably not too important, if the cached files change between the first and the second pass it will do a suboptimal cleanup but it would not corrupt the cache.
This strategy could work if removing the files is what takes most of the time.
Another crazy idea: when cleaning the cache, we rename the cache directory to a temporary name, then perform the cleaning in that temporary location, and then merge whatever cache entries concurrent clcache invocations may have created into the cleansed repository while the cache was cleaned. This might have a slightly negative impact on the cache hit rate (since during the cleaning, the cache appears to be empty, so objects get recreated), but it would probably decouple the clcache instances quite nicely. I'm not sure how easy it is to merge two caches though.
I am seeing this one pretty often now:
clcache.main.CacheLockException: Failed to acquire lock Local\/c/Users/kayha-clcache-manifests-58 after 10000ms; try setting CLCACHE_OBJECT_CACHE_TIMEOUT_MS environment variable to a larger value.
Actually I had single builds for Nuitka where it happened for 8 files. Now, I am getting these spuriously in my CI, but not just occasionally. My CI is often running all day, and it happens multiple times. I have seen instances where 6 calls in the same compilation of Scons failed all with that error, which felt pretty remarkable, all for different hash codes.
Nuitka means to use a libnuitka
to avoid re-compiling same files, but right now compiles the same C (8 maybe) files over and over for every compilation. Compilations use CPU core count jobs, which might be 8 on my machine, and typically there are 2 jobs running at the same time, so collisions are high.
I was reading about a daemon of clcache, and wondering if it makes the locking problem go away, I didn't immediately employ it, because of additional setup. Do you think that is going to avoid the issue?
@kayhayen The clcachesrv
caching demon only avoids unneeded re-hashing of header files. In general, since this is a timing-dependent issue, it could make things better or worse. I don't expect any signifiant improvement though (but I'd love to be mistaken).
@frerich I have a suspect that the clcache default folder that I use lives outside of what my Buildbot worker uses, and that Windows Defender is causing the timeouts. I have add similar issues with other files and attempts to unlink them.
I will now set the environment variable to point inside the already ignored folder. Even if only to clean things up in terms of that user directory folder, my hope is to avoid contention with Windows Defender.
Can we have CLCACHE_DISABLE_AUTOCLEAN ? We can run clcache -c before build, and if cache grows a bit over 100%, who cares. It is much better then crash during build. I will test commits more following days.
Also if would have clcache -c 70 to reduce cache size to other then 90% we could easly avoid cleaning during build.
I have created 3 commits that implement my above ideas, feel free to use it:
Clean storage by arbitrary percentage https://github.com/9a4gl/clcache/commit/9fd70d59de9602ea4d64c7da21ba68c7e6453e82
Disable auto clean during build if CLCACHE_DISABLE_AUTOCLEAN defined https://github.com/9a4gl/clcache/commit/d03ff7d679a8255d1a313f974e3c66f1ac6ca447
Force storage cleaning if cleaning from command line https://github.com/9a4gl/clcache/commit/030993b44dbac29d92e14186605b8f0cf0ece4fd
We run into this issue almost immediately after enabling clcache in all our CI builds instead of select few. We have max cache size about 16Gb with a single clean build producing about 4Gb of object files. The cache folder is reused between building on several branches.
I am not sure, but if the hypothesis about long cache cleanup is true, then I don't understand why it didn't happen even once when I had max cache size standard at 1Gb and the object files total size was even larger.
Might be of interest:
clcache.__main__.CacheLockException: Failed to acquire lock Local\E--jenkins-workspace-build-phab-phab-builder-trassir120-__build_cmake_trassir120-debug.clcache-manifests-13 after 10000ms
In 13
folder there is 13ef76f673240777fb01afc6e562abab.json
of 9 MB size and 82146 lines. The next time it failed on 17
folder: and there is 177c13a32b186e4c73db46d53c905999.json
of 12 MB size. It contains 108058 lines, mostly of various headers. Parsing it shows that there are 69 objects in entries
, each having 1560 strings in includeFiles
.
Could it be that the lock acquiring fails due failing to process data in said excessive json in time?
Okay, so it seems that our issue is more related to #347.
We managed to mitigate the problem by increasing timeout up to 60000
ms. Profiling with ninjatracing revealed that multiple source files took up to 40-60 seconds to compile. Further profiling with CLCACHE_PROFILE
(involving #363 and #364) showed that such compilation units spend most of the time in processDirect -> getIncludesContentHashForFiles -> getFileHashes
, which supports the theory about large json maniphests being the problem.
I believe that clcache accumulates too many additional entries for the maniphest file (corresponding to command line) each time one of "dependency headers" of the compiled file changes. It also seems that there is some inefficiency due to hash being recalculated multiple time for the same headers mentioned in different entries:
cl.exe /showIncludes
). This way:
getFileHash
to avoid unnecessary reading and hashing file contentsEDIT Not justs cache stats.txt - just got few cachelockexception's from other parts.. so my "theory" can be thrown out .. Sorry for noise. EDIT I've been running my own fork of clcache for few months, with few patches mentioned in this issue (disable auto cleaning) and still getting these CacheLockExceptions.. @frerich mentioned following few years back:
In your error message it's visible that the lock being acquired (this is implemented in terms of the
CreateMutex
function) has the nameLocal\C--Users-ogdf-clcache-manifests-45
Now, this is not what happens in my case, for me the lock is for stats.txt:
clcache.__main__.CacheLockException: Failed to acquire lock Local\J--j-workspace-CompanyNameSetup_master_gerrit-ccache-stats.txt after 30000ms; try setting CLCACHE_OBJECT_CACHE_TIMEOUT_MS environment variable to a larger value.
Past 3-5 fails so far and its always the stats.txt file and lock is being acquired @ https://github.com/rasjani/clcache/blob/alt-release/clcache/__main__.py#L1867
Our build is driven with Ninja so there's heavy parallerism. Stats are not that important - would it make sense (as a hack) to just ignore the lock to update the stats and maybe just issue a warning the stats are out of date ? Essentially, i dont need this sort of change to land to this repo, i can just add it to my own fork and wait for proper fixes if and when those might come.
@rasjani Do you use clcache-server
? I was expecting it to make things better.
@Kojoley - been running with the server enabled for few weeks and still seeing random cachelock exceptions.. However, last 3 where all about updating stats.txt ..
Here's some general info about our system:
We are using this machine for CI. Build duration is usually 5 to 8 minutes and mostly spent on compilation. Every build triggers up to 4 compiler calls in parallel and there can be two parallel builds (up to 8 concurrent compiler calls).
Since trying out clcache for a few days we have disabled it again because of cache errors still existing after #244.
Here's the error message we are receiving every now and then:
As suggested in #229 I've upped the
CLCACHE_OBJECT_CACHE_TIMEOUT_MS
to 120.000 ms which seems to cause our builds to succeed. I've noted that the error seems to only occur when there are cache misses. Our cache is pretty much filled.I've re-enabled
clcache
in our CI now and will report if any errors arise withCLCACHE_OBJECT_CACHE_TIMEOUT_MS=120000
.I don't know how exactly your locking mechanism works but already a default timeout of 10 seconds seems extremly high to me. You stated that "it might be the (unlikely) case that multiple clcache invocations are accessing the same manifest section". How is this section determined?