Clcache performance issues with big projects

YngveNPettersen commented 5 years ago

When we started using clcache for our builds we encountered a number of performance issues that either "just" slowed down the builds, or caused them to fail.

Our project is based on Chromium, with a total of 30000+ ninja task items, about 20000+ of which are object files.

The main issue with such a big project is that the clcache file based system quickly accumulate so many files that the cleanup operation is never able to finish before the locks other build items are waiting for times out, and the build fails, even using a 100 second timeout. The number of ninja processes used (30+), probably does not help.

The result is that the size of the cache keep increasing, and I eventually had to clean up the hard way: By deleting hundreds of thousands of files (IIRC I have seen cases of more than one million files) and multiples of the configured cache size (reaching sizes in excess of 100GB).

My conclusion, eventually, was to throw out the file based cache, and replace it with a database.

The first attempt was to use sqlite3. Unfortunately, the locking system, either clcache's or sqlite's wasn't up to the task, so I eventually rewrote the new code to use a Postgresql database.

This have worked fairly well, although I had to move the cleanup operation out as a separate task item at the end of the build, as cleanup could still take a bit of time.

A benefit of using a shared database is that, if you have multiple machines with the same configuration, they do not need to rebuild an object file if another machine has already built it.

However, I still kept seeing file lock timeouts (some may have been in-process deadlocks, but I never found out properly), this time with the stats file, as well as "stats file was bad" warnings. Eventually, after several attempts to fix the issue, I just said "Enough!", and disabled the stats read/write operations for the database mode, unless logging is enabled.

While I do have a database mode for our clcache system, I don't think it should be upstreamed.

The reason for not upstreaming our system is that it is hardcoded to use Postgresql, and would likely need a lot of tweaking for each different database backend it is used with, even if the queries I have been using are fairly elementary.

The "proper" way to implement a database mode would be to use a portable database framework, for example Django, that transparently manages the connection with the database server, reducing the need for customized code. (Optimized indexing might need customization, although I am not sure if/how Django have improved in this area since I last used it). I am not sure Django is the right choice, given its emphasis on web page templates, but it has worked fine for me in non-web situations before, so it will do the job,.

I did not us the "proper way" when I implemented the Sqlite version, as I knew I would need to handle a lot of locking manually, and since the Postgresql version was "just" a tweak of the original Sqlite implementation, the tailored code remained.

I would suggest that a database mode is explored for clcache, and that it should be based on a Database framework.

frerich commented 5 years ago

I think a 'real' database might actually solve more than just the locking;' the automatic clash cleansing (which depends on the atime of files being updated -- something which is not on by default!) might benefit, too.

This idea has been discussed to some (small) extent in #251 , in particular in https://github.com/frerich/clcache/issues/251#issuecomment-281916252

frerich commented 5 years ago

There is already a very humble attempt at abstracting the backend, in storage.py; this is used to implement the file-based as well as the memcache-based backends. Maybe only smaller extnensions would be needed to accomodate an e.g. (no-)SQL based database?

YngveNPettersen commented 5 years ago

I did have to make some smaller API changes in the classes used for strategy, manifest, payload, etc., mostly because more information, especially about the manifest ID, was needed.

brupelo commented 5 years ago

@YngveNPettersen After googling for a bit similar tools to ccache for visual studio I've landed here... The project I'd like to test this tool isn't as big as chromium but it's also decently big, qtbase... a cold build is giving me like ~20min. Anyway, is this tool just a little experiment/toy intended to use with small projects or it'd bring real benefits on real-world projects? From what I've read so far it seems the former, could you talk about your experience with it?

If it's still not good enough, do you know any alternative?

Thanks :)

apriori commented 5 years ago

@brupelo @YngveNPettersen Btw, did you maybe look into https://github.com/mozilla/sccache? I use it so far successful with a decently sized project and shared storage (minio based).

frerich / clcache

Clcache performance issues with big projects #347