frerich / clcache

A compiler cache for MSVC, much like ccache for gcc
Other
325 stars 83 forks source link

Switch file hashing to xxhash #243

Closed hubx closed 7 years ago

hubx commented 8 years ago

xxhash was mentioned in #217 and hashing is currently under discussion in #241. I wanted to show how trivial switching to xxhash is and throw it into the discussion again.

For me (ssd, massive parallel compilation) it decreased the compilation time only by 2.4% but others might see more signicifant improvements.

// cc @heremaps/cci @Jimilian @sschuberth

sschuberth commented 8 years ago

@hubx AppVeyor CI fails with

Traceback (most recent call last):
  File "clcache.py", line 23, in <module>
    import xxhash
ImportError: No module named 'xxhash'

Would you mind fixing this by adding pip install xxhash to appveyor.yml?

webmaster128 commented 8 years ago

I don't think the speed minor improvement justifies the requirement of an external dependency. I'd leave this as an opt-in speed improvement for large scale projects. For simple programs that compile 5 or 10 minutes, it is more important to get the cache up and running quickly.

frerich commented 8 years ago

it decreased the compilation time only by 2.4% but others might see more signicifant improvements.

To be fair, others might see less significant improvements, too. ;-)

Joking aside, I suspect that xxhash might be an improvement -- I recall having come across it, too, when checking for ways to make cache hits faster (the runtime for cache hits seems to be dominated by reading files and hashing data). However, I'd rather like to see some benchmarks first before making a call. It's good to see my assumption confirmed that changing the hash algorithm is easy, though!

frerich commented 8 years ago

Reopening this, didn't actually mean to close the PR just yet.

Jimilian commented 8 years ago

@frerich Would microbenchmark on "real" data (random input sequence with real length) satisfy your requirements? Of course, it should be same test for Python2.7/Python3.3/Python3.4/Python3.5.

frerich commented 8 years ago

For the record, it seems that xxhash is indeed a lot faster than hashlib.md5 (about twice as fast in my experiments) but it's also a non-trivial dependency: it's not only not part of the standard library, but also requires a very specific Visual Studio installation (matching the Python version) since the module includes C code.

Right now, clcache supports Python 3.3, 3.4 and 3.5. The WindowsCompilers page explains that for 3.3 and 3.4, Visual Studio 2010 needs to be installed, for 3.5, Visual Studio 2015 needs to be installed. I have neither though (I use Visual Studio 2008 and Visual Studio 2013 for my daily work), so I needed to go hunting a bit to find the right downloads. Since I use Python 3.5, there seems to be a download of the Visual C++ Build Tools available which gets you just the required command line tools needed for pip install xxhash to work.

The bottom line is: I think documenting this alternative hashing algorithm probably makes sense, or making it optional at runtime -- but it does not seem to have enough impact to actually warrant the complication of the installation.

frerich commented 8 years ago

@Jimilian I think the performance test included in performancetests.py would be a good start. It actually exercises the entire roundtrip. It would be nice to see how it changes with different hash algorithms.

For what it's worth, I did a quick test run in a Windows VM, and here's the output I get:

Compiling 30 source files sequentially, cold cache: 12.269236257898246 seconds Compiling 30 source files sequentially, hot cache: 3.1958666312493627 seconds Compiling 30 source files concurrently via /MP2, hot cache: 3.219691360815654 seconds

Compiling 30 source files sequentially, cold cache: 11.387133487436277 seconds Compiling 30 source files sequentially, hot cache: 3.1238149097423946 seconds Compiling 30 source files concurrently via /MP2, hot cache: 3.1892145836394654 seconds

In this particular scenario, the improvement is certainly measurable but I'm not sure whether it's big enough to warrant changing the default hash algorithm and imposing the more complicated installation. However, my test setup suffers from a somewhat slow I/O though (it's in a Windows VM) so with faster hard disks, maybe you see much bigger improvements?

I suspect that a much larger effect can be achieved by not hashing as many files (in other reports, e.g. #239 , it was found that since clcache processes don't communicate, a large number of files is hashed repeatedly) in the first place.

hubx commented 8 years ago

@sschuberth done. Let's see if this works with the explained difficulties of building native extensions for Python on Windows with the corresponding Visual Studio version.

@frerich I forgot about the build dependencies there, since I set up scandir for Python 3.4 recently so the installation of xxhash was transparent for me :/.

But I would refrain from making it configurable. Rather check what is available and the prefer the faster hashing algorithm.

frerich commented 8 years ago

But I would refrain from making it configurable. Rather check what is available and the prefer the faster hashing algorithm.

That sounds like a plan. If you can adjust this PR such that it optionally uses xxhash, I think there's no real reason not to merge it.

sasobadovinac commented 8 years ago

I was not able to get any useful speedup in my tests when switching to xxhash https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.360

frerich commented 7 years ago

Closing this; xxhash is indeed faster, but not so much faster that it would justify the more difficult installation it seems. Furthermore, it was shown that a much better optimisation is to not compute hash sums in the first place but rather hash them.

Thanks everyone for the constructive discussion!

sasobadovinac commented 7 years ago

Would it make sense to retest this with all the new updates in 4.0.0 release?