Closed TiloW closed 7 years ago
Integration tests seem to be failing. I'll have to see whether I can get them to execute on my local machine before fixing this.
Here are two builds from me using this, I do see a bit faster builds, but this was still with --onefile
https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.481 msbuild time 36:15.79 https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.482 msbuild time 08:52.62
@@ master #255 diff @@
==========================================
Files 1 1
Lines 1035 1040 +5
Methods 0 0
Messages 0 0
Branches 173 166 -7
==========================================
+ Hits 924 930 +6
Misses 82 82
+ Partials 29 28 -1
Powered by Codecov. Last update 4b0ebec...6a68a7c
@frerich thanks for the review! Integration tests are working now.
Great, thanks for getting the tests going! I'll have another careful look at the code later today. I'm a little scared by using multiple threads since nothing in the code base is prepared for that, e.g. I wonder whether there could be concurrent accesses to the Python dictionary holding the statistics or so. :-}
I wonder whether there could be concurrent accesses
That is indeed something that should be double-checked. Are there any global variables?
We could perform deep copies before passing arguments to processActualCompileRequest
to avoid concurrent access. I believe most values are only read and it would suffice to copy the cache
-parameter.
By the way: the error that just occured on the CI is also happening on master on my machine.
Here are two builds from me using this, I do see a bit faster builds, but this was still with --onefile
Interesting, thanks for trying this. --onefile
should not matter with this change unless you are invoking the compiler many times with few source files. Could you provide some numbers, i.e., how much faster exactly is this patch for you?
How often do you invoke the compiler duing your build and with how many files per invokation?
I would also be interested in build times for warm caches.
For example, during a typical build I have only two compiler invokations and about 360 source files in total.
Interesting, thanks for trying this. --onefile should not matter with this change unless you are invoking the compiler many times with few source files.
For the record, this is indeed one of the two common usage patterns:
Some build systems (apparently most commonly those which are portable to non-Microsoft platforms) appear to follow a very traditional GNU Make style, in which dependency tracking and concurrency is not managed by the compiler but rather by the build system: e.g. a library depends on many individual object files, each of which depending on a single source file. Generating the object file means invoking the compiler with a single source file. In fact, the only portable build tool I'm aware of which exploits the concurrency built into Visual Studio is qmake. For what it's worth, this is the scenario for my own personal use case which initially triggered the development of clcache. :-)
Build systems which are more specialized on Windows seem to rely on the depend on the concurrency built into cl.exe
a lot more, e.g. anything built on top of msbuild.
By the way: the error that just occured on the CI is also happening on master on my machine.
Interesting - I did not see that before. Can you reproduce it reliably? I wonder whether there's a race condition by subprocess.check_output
returning earlier than expected, i.e. the main.exe
executable is still in use somehow and thus it cannot be deleted. If so, this could maybe be reproduced with a sample loop copying some .exe
file, then launching it via check_output
and then deleting it.
I did not see that before. Can you reproduce it reliably?
Not really, it happens occasionally and doesn't seem to be tied to a specific invocation of _clean()
.
the global stdout stream. I don't know whether it can safely be used from multiple threads
Neither do I.
calls
sys.exit
I've changed this to raising an exception.
Did you already try how printBinary and print behave when invoked concurrently?
Not explicitly. My build log looks fine though.
Thanks for your remarks. It's been a while since I've last written something in python.
Regarding #248, for builds that invoke the compiler few times with many sources, it is very easy (no need for a server) to cache the computed file hashes during that invocation with this change. I've tested this and experienced a minor speedup (~5s) that is independent of warm/cold caches. This decreases runtime for warm caches to 12s for me.
There are still some things that I consider discussion-worthy regarding my changes:
ThreadPoolExecutor
is used even for invocations with a single source file and for invocations without /MP
. This might increase runtime in such scenarios (depending on whether it actually creates a new thread when there is no need for it).jobCmdLine
is constructed, we do not preserve the order of arguments. I am not entirely sure whether this might cause any issues./MP
should be added.ThreadPoolExecutor is used even for invokations with a single source file and for invokations without /MP. This might increase runtime in such scenarios (depending on whether it actually creates a new thread when there is no need for it).
Right, this is something which simply should be benchmarked.
The way the jobCmdLine is constructed, we do not preserve the order of arguments. I am not entirely sure whether this might cause any issues.
The only difference is that the /MP
switch is omitted and the source file argument goes last, if I'm reading the code correctly? That should be fine.
Tests for invoking the compiler with multiple sources and /MP should be added.
There are some integration tests already in integrationtests.py
, namely TestRunParallel
an TestMultipleSources
. Maybe they would be good places to add any test cases you have in mind?
I've added a test for the command line output in #258. This PR should be merged afterwards.
The only difference is that the
/MP
switch is omitted and the source file argument goes last, if I'm reading the code correctly? That should be fine.
Right, everything else remains the same.
Right, this is something which simply should be benchmarked.
Will do.
I've got some more numbers for the current version of the PR.
As mentioned by @akleber, ninja
calls the compiler with a single source file each.
Thus, comparing ninja
builds with this PR to the master should give a good estimate of the runtime-impact of using threads even when not needing it. I've run the tests on a different (non SSD) drive.
The numbers are averaged across 3 builds each.
system | clcache | warm | this PR | time |
---|---|---|---|---|
ninja | 0 | - | - | 2:27 |
msbuild | 0 | - | - | 2:12 |
ninja | 1 | 0 | 0 | 2:59 |
msbuild | 1 | 0 | 0 | 2:51 |
ninja | 1 | 1 | 0 | 0:41 |
msbuild | 1 | 1 | 0 | 0:34 |
ninja | 1 | 0 | 1 | 2:56 |
msbuild | 1 | 0 | 1 | 2:28 |
ninja | 1 | 1 | 1 | 0:43 |
msbuild | 1 | 1 | 1 | 0:22 |
ninja
seems to be slower than msbuild
, independently of clcache
.
It is not clear from these numbers whether there is a runtime increase to be expected with this PR when using single file invocation. I've read that multithreading doesn't perform as well because of CPython's global interpreter lock. This is claimed to affect threads that are cpu bound more than threads limited by IO. While this would be in-line with the above numbers, it's rather speculative.
I'm not sure why the runtime improvement is not as big as for the initial version of this PR but it might be caused by a542629.
Thanks for the numbers!
It is not clear from these numbers whether there is a runtime increase to be expected with this PR when using single file invocation.
Maybe it's not crystal clear, but I think the timings do allow some qualitative statement about this PR if you regroup them a bit. Here are the timings for cold caches (i.e. dominated by cache misses):
system | clcache | warm | this PR | time |
---|---|---|---|---|
ninja | 1 | 0 | 0 | 2:59 |
msbuild | 1 | 0 | 0 | 2:51 |
ninja | 1 | 0 | 1 | 2:56 |
msbuild | 1 | 0 | 1 | 2:28 |
In the case of cold caches, single-file invocations (represented by ninja
) perform as fast as before. Certainly no worse. msbuild, passing multiple files, benefits by being about 14% faster -- which was the whole point of this PR.
Here are the timings for warm caches, i.e. mostly cache hits:
system | clcache | warm | this PR | time |
---|---|---|---|---|
ninja | 1 | 1 | 0 | 0:41 |
msbuild | 1 | 1 | 0 | 0:34 |
ninja | 1 | 1 | 1 | 0:43 |
msbuild | 1 | 1 | 1 | 0:22 |
Again, single-file invocations are basically unchanged (I think it's safe to say that two seconds are within the margin for measurement errors), but msbuild benefits - the build being a nice 36% faster.
Hence, in neither use case, ninja suffers - but msbuild always benefits (more so for warm caches in which case the runtime isn't dominated by invoking the real compiler).
Based on this, I think the PR looks really good and I really have to think hard whether I can think of compelling reasons not to merge it right away. It's really just the "The code was never written with multiple threads in mind" insight which makes me be very cautious. :-]
I really have to think hard whether I can think of compelling reasons not to merge it right away
Now that #248 and #258 are merged a rebase is in order.
I would also like to have a closer look at the runtime-impact of the newly introduced locks and instantiation of Cache
-objects for each thread. This could of course also be done in a subsequent PR.
It's really just the "The code was never written with multiple threads in mind" insight which makes me be very cautious.
:+1:
Now that #248 and #258 are merged a rebase is in order.
Really? The PR still seems to merge cleanly (i.e. not cause any conflicts) according to GitHub.
The PR still seems to merge cleanly
It does but we don't know whether the tests introduced in #258 are passing with these changes.
I'd say this is ready to be merged. The issue described in #261 is not affected by this PR.
I tend to agree, this looks good to me. Let's merge it and see how it goes.
Thanks a lot for your great work and constructive communication - would you mind me adding you to the README
file in the section of who gets the blame (or fame?) for the program?
would you mind me adding you to the README file in the section of who gets the blame (or fame?)
Sure, glad I could contribute. (:
This idea was coined when we oberserved rather large increases in build time when comparing
clcache
on cold caches to compiling withoutclcache
.One thing supposedly contributing to this increase is that
clcache
re-invokes itself for every source file to be compiled. This is particularly costly when usingpyinstaller
's--onefile
flag, as reported in #232.This PR proposes to switch from re-invoking
clcache
to multi-threading. Some changes had to be made to where output is performed and the command line analyzer was modified to return a list of object files when given multiple source files (this is why the unit tests did change).Initially, I wanted to also invoke the real compiler at most once per invokation of
clcache
but this is not contained in this PR.Integration tests failed on my machine even without changing anything on the current master. I did not bother adjusting them. If this is an error on my side, please let me know.
What does this change do performance-wise?
The following times are obtained on a machine operating on windows 10 where everything takes place on the same SSD and builds are performed in parallel using 4 Cores @ 3.3 GHz. We are using
CLCACHE_HARDLINK
.--onefile
Edit: I just realized that this table is somewhat confusing:
--onefile = 1
means that I've usedpyinstaller
without the flag.Note that with these changes the compiler is invoked only twice during our build. Thus, it does no longer matter to us whether
--onefile
is used. Apart from multithreading being faster, this would clearly be the more convenient solution for users that do invoke the compiler with many source files (e.g.,msbuild
invokes the compiler once for every target).