frerich / clcache

A compiler cache for MSVC, much like ccache for gcc
Other
324 stars 83 forks source link

Use multi threading instead of re-invoking clcache #255

Closed TiloW closed 7 years ago

TiloW commented 7 years ago

This idea was coined when we oberserved rather large increases in build time when comparing clcache on cold caches to compiling without clcache.

One thing supposedly contributing to this increase is that clcache re-invokes itself for every source file to be compiled. This is particularly costly when using pyinstaller's --onefile flag, as reported in #232.

This PR proposes to switch from re-invoking clcache to multi-threading. Some changes had to be made to where output is performed and the command line analyzer was modified to return a list of object files when given multiple source files (this is why the unit tests did change).

Initially, I wanted to also invoke the real compiler at most once per invokation of clcache but this is not contained in this PR.

Integration tests failed on my machine even without changing anything on the current master. I did not bother adjusting them. If this is an error on my side, please let me know.

What does this change do performance-wise?

The following times are obtained on a machine operating on windows 10 where everything takes place on the same SSD and builds are performed in parallel using 4 Cores @ 3.3 GHz. We are using CLCACHE_HARDLINK.

clcache this PR --onefile warm cache build time
0 - - - 2:10
1 0 0 0 3:40
1 0 0 1 1:10
1 0 1 0 2:57
1 0 1 1 0:32
1 1 0 0 2:27
1 1 0 1 0:17

Edit: I just realized that this table is somewhat confusing: --onefile = 1 means that I've used pyinstaller without the flag.

Note that with these changes the compiler is invoked only twice during our build. Thus, it does no longer matter to us whether --onefile is used. Apart from multithreading being faster, this would clearly be the more convenient solution for users that do invoke the compiler with many source files (e.g., msbuild invokes the compiler once for every target).

TiloW commented 7 years ago

Integration tests seem to be failing. I'll have to see whether I can get them to execute on my local machine before fixing this.

sasobadovinac commented 7 years ago

Here are two builds from me using this, I do see a bit faster builds, but this was still with --onefile

https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.481 msbuild time 36:15.79 https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.482 msbuild time 08:52.62

codecov-io commented 7 years ago

Current coverage is 89.42% (diff: 89.87%)

Merging #255 into master will increase coverage by 0.14%

@@             master       #255   diff @@
==========================================
  Files             1          1          
  Lines          1035       1040     +5   
  Methods           0          0          
  Messages          0          0          
  Branches        173        166     -7   
==========================================
+ Hits            924        930     +6   
  Misses           82         82          
+ Partials         29         28     -1   

Powered by Codecov. Last update 4b0ebec...6a68a7c

TiloW commented 7 years ago

@frerich thanks for the review! Integration tests are working now.

frerich commented 7 years ago

Great, thanks for getting the tests going! I'll have another careful look at the code later today. I'm a little scared by using multiple threads since nothing in the code base is prepared for that, e.g. I wonder whether there could be concurrent accesses to the Python dictionary holding the statistics or so. :-}

TiloW commented 7 years ago

I wonder whether there could be concurrent accesses

That is indeed something that should be double-checked. Are there any global variables? We could perform deep copies before passing arguments to processActualCompileRequest to avoid concurrent access. I believe most values are only read and it would suffice to copy the cache-parameter.

TiloW commented 7 years ago

By the way: the error that just occured on the CI is also happening on master on my machine.

TiloW commented 7 years ago

Here are two builds from me using this, I do see a bit faster builds, but this was still with --onefile

Interesting, thanks for trying this. --onefile should not matter with this change unless you are invoking the compiler many times with few source files. Could you provide some numbers, i.e., how much faster exactly is this patch for you? How often do you invoke the compiler duing your build and with how many files per invokation? I would also be interested in build times for warm caches.

For example, during a typical build I have only two compiler invokations and about 360 source files in total.

frerich commented 7 years ago

Interesting, thanks for trying this. --onefile should not matter with this change unless you are invoking the compiler many times with few source files.

For the record, this is indeed one of the two common usage patterns:

frerich commented 7 years ago

By the way: the error that just occured on the CI is also happening on master on my machine.

Interesting - I did not see that before. Can you reproduce it reliably? I wonder whether there's a race condition by subprocess.check_output returning earlier than expected, i.e. the main.exe executable is still in use somehow and thus it cannot be deleted. If so, this could maybe be reproduced with a sample loop copying some .exe file, then launching it via check_output and then deleting it.

TiloW commented 7 years ago

I did not see that before. Can you reproduce it reliably?

Not really, it happens occasionally and doesn't seem to be tied to a specific invocation of _clean().

TiloW commented 7 years ago

the global stdout stream. I don't know whether it can safely be used from multiple threads

Neither do I.

calls sys.exit

I've changed this to raising an exception.

Did you already try how printBinary and print behave when invoked concurrently?

Not explicitly. My build log looks fine though.

Thanks for your remarks. It's been a while since I've last written something in python.

TiloW commented 7 years ago

Regarding #248, for builds that invoke the compiler few times with many sources, it is very easy (no need for a server) to cache the computed file hashes during that invocation with this change. I've tested this and experienced a minor speedup (~5s) that is independent of warm/cold caches. This decreases runtime for warm caches to 12s for me.

There are still some things that I consider discussion-worthy regarding my changes:

frerich commented 7 years ago

ThreadPoolExecutor is used even for invokations with a single source file and for invokations without /MP. This might increase runtime in such scenarios (depending on whether it actually creates a new thread when there is no need for it).

Right, this is something which simply should be benchmarked.

The way the jobCmdLine is constructed, we do not preserve the order of arguments. I am not entirely sure whether this might cause any issues.

The only difference is that the /MP switch is omitted and the source file argument goes last, if I'm reading the code correctly? That should be fine.

Tests for invoking the compiler with multiple sources and /MP should be added.

There are some integration tests already in integrationtests.py, namely TestRunParallel an TestMultipleSources. Maybe they would be good places to add any test cases you have in mind?

TiloW commented 7 years ago

I've added a test for the command line output in #258. This PR should be merged afterwards.

TiloW commented 7 years ago

The only difference is that the /MP switch is omitted and the source file argument goes last, if I'm reading the code correctly? That should be fine.

Right, everything else remains the same.

Right, this is something which simply should be benchmarked.

Will do.

TiloW commented 7 years ago

I've got some more numbers for the current version of the PR. As mentioned by @akleber, ninja calls the compiler with a single source file each. Thus, comparing ninja builds with this PR to the master should give a good estimate of the runtime-impact of using threads even when not needing it. I've run the tests on a different (non SSD) drive. The numbers are averaged across 3 builds each.

system clcache warm this PR time
ninja 0 - - 2:27
msbuild 0 - - 2:12
ninja 1 0 0 2:59
msbuild 1 0 0 2:51
ninja 1 1 0 0:41
msbuild 1 1 0 0:34
ninja 1 0 1 2:56
msbuild 1 0 1 2:28
ninja 1 1 1 0:43
msbuild 1 1 1 0:22

ninja seems to be slower than msbuild, independently of clcache. It is not clear from these numbers whether there is a runtime increase to be expected with this PR when using single file invocation. I've read that multithreading doesn't perform as well because of CPython's global interpreter lock. This is claimed to affect threads that are cpu bound more than threads limited by IO. While this would be in-line with the above numbers, it's rather speculative.

I'm not sure why the runtime improvement is not as big as for the initial version of this PR but it might be caused by a542629.

frerich commented 7 years ago

Thanks for the numbers!

It is not clear from these numbers whether there is a runtime increase to be expected with this PR when using single file invocation.

Maybe it's not crystal clear, but I think the timings do allow some qualitative statement about this PR if you regroup them a bit. Here are the timings for cold caches (i.e. dominated by cache misses):

system clcache warm this PR time
ninja 1 0 0 2:59
msbuild 1 0 0 2:51
ninja 1 0 1 2:56
msbuild 1 0 1 2:28

In the case of cold caches, single-file invocations (represented by ninja) perform as fast as before. Certainly no worse. msbuild, passing multiple files, benefits by being about 14% faster -- which was the whole point of this PR.

Here are the timings for warm caches, i.e. mostly cache hits:

system clcache warm this PR time
ninja 1 1 0 0:41
msbuild 1 1 0 0:34
ninja 1 1 1 0:43
msbuild 1 1 1 0:22

Again, single-file invocations are basically unchanged (I think it's safe to say that two seconds are within the margin for measurement errors), but msbuild benefits - the build being a nice 36% faster.

Hence, in neither use case, ninja suffers - but msbuild always benefits (more so for warm caches in which case the runtime isn't dominated by invoking the real compiler).

Based on this, I think the PR looks really good and I really have to think hard whether I can think of compelling reasons not to merge it right away. It's really just the "The code was never written with multiple threads in mind" insight which makes me be very cautious. :-]

TiloW commented 7 years ago

I really have to think hard whether I can think of compelling reasons not to merge it right away

Now that #248 and #258 are merged a rebase is in order. I would also like to have a closer look at the runtime-impact of the newly introduced locks and instantiation of Cache-objects for each thread. This could of course also be done in a subsequent PR.

It's really just the "The code was never written with multiple threads in mind" insight which makes me be very cautious.

:+1:

frerich commented 7 years ago

Now that #248 and #258 are merged a rebase is in order.

Really? The PR still seems to merge cleanly (i.e. not cause any conflicts) according to GitHub.

TiloW commented 7 years ago

The PR still seems to merge cleanly

It does but we don't know whether the tests introduced in #258 are passing with these changes.

TiloW commented 7 years ago

I'd say this is ready to be merged. The issue described in #261 is not affected by this PR.

frerich commented 7 years ago

I tend to agree, this looks good to me. Let's merge it and see how it goes.

Thanks a lot for your great work and constructive communication - would you mind me adding you to the README file in the section of who gets the blame (or fame?) for the program?

TiloW commented 7 years ago

would you mind me adding you to the README file in the section of who gets the blame (or fame?)

Sure, glad I could contribute. (: