crazyzhangcrazy / distcc

Automatically exported from code.google.com/p/distcc
GNU General Public License v2.0
0 stars 0 forks source link

--localslots not constraining the concurrency of local jobs that cannot run remotely #153

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Answering the following questions is a big help:

1. What version of distcc are you using (e.g. "2.7.1")?  You can run "distcc 
--version" to see.  If you got distcc from a distribution package rather than 
building from source, please say which one.

distcc-3.2rc1

2. What platform are you running on (e.g. "Red Hat 8.0", "HP-UX 11.11")?  What 
compilare are you using ("gcc 3.3")?  Run "uname -a" and "cc --version" to see.

RHEL 6.4, GCC 4.4.7

3. What were you trying to do (e.g. "install distcc", "build Mozilla")?

tweaking the --localslots hosts list option to optimize performance

4. What went wrong?  Did you get an error message, did it hang, did it build a 
program that didn't work, did it not distribute compilation to machines that 
ought to get it?

The flag does not appear to have any effect on the build. I expected tweaking 
this value to have an effect on the number of concurrently executed linking 
jobs, where setting the value to 1 meant no concurrent linking jobs would 
happen. However, I seem to be getting up to 4 concurrent local linking jobs 
regardless of what I set. This is further confirmed by the distcc logs, which 
shows that my --localslots=1 was being read in (with no parsing error) whenever 
the host list was being read. However, any of my distcc subprocesses that 
handled linking, after saying "called for link? i give up", would try to get 
slots 2+ after cpu_localhost_0 is found to be busy.

After reading the source code, it looks like --localslots is taken into account 
when the host list is read and parsed, where these -- flags are treated as 
special cases. Specifically, the --localslots and --localslots_cpp will update 
the n_slots field in global variables dcc_hostdef_local and 
dcc_hostdev_local_cpp. The n_slots fields are originally initialized to 4 and 
8, so if the host list wasn't read that's what they will be.

This issue happens near line 524 in compile.c. Here, distcc parses the 
arguments in dcc_scan_args and decides that it's a linking command that cannot 
be compiled remotely, and therefore skips straight to lock_local. The host list 
was never read prior to this logic, and the skip bypasses any later points 
where the host list would have been read (such as during 
dcc_pick_host_from_list_and_lock_it). As a result, the host list was never read 
in, dcc_hostdev_local->n_slots was never updated, and hence I would always see 
up to 4 concurrent linking commands regardless of what --localslots is set to.

So at the moment, the only effect --localslots would have is controlling the 
number of jobs that failed remotely (this is only a subset of "that cannot be 
run remotely") that can be run concurrently on the local machine.

It looks like this can be fixed easily by simply running dcc_get_hostlist once 
before the "goto lock_local" line. In the case where hostlist fails to be 
obtained, perhaps it can just issue a warning that mentions the consequence 
that --localslots restrictions cannot be applied (although if the hostlist 
can't be obtained I would imagine there are far bigger things to worry about).

I did some manual testing with this fix and saw that it's working as expected 
again. I would love it if someone can double-check my findings/code 
understanding to make sure they're correct, and if so, I would like to submit a 
patch (how do I do this?), and some guidance on where/how to add a unit test 
for this would be great.

Thanks,
Jack

P.S. checked with latest source and confirmed problem still exists, also 
couldn't find any existing open issues (searched for "localslots"), so I don't 
think this has been brought up before.

Original issue reported on code.google.com by jack.zha...@gmail.com on 5 Jun 2015 at 12:56

GoogleCodeExporter commented 8 years ago
Hello, 
Any ideas if this can be temporarly solved without editing distcc sources? It's 
quite painful when you have 40 cores, 40 binaries to link and linking using 4 
cores...
Thanks in advance!

Original comment by mattia.b...@gmail.com on 30 Jun 2015 at 11:22