next steps for parallel clang plugin

regehr commented 3 years ago

I just want to jot down a few thoughts before I forget them. I don't plan to work on these items immediately, there's more important stuff to do. anyway, here are what I think are the next items for parallel alive2:

write a third parallel manager along lines Nuno has previously suggested: using a global counting semaphore or a named pipe or whatever. this frees us from GNU make and its awkward jobserver protocol
starting with the current eager forking strategy in tv.cpp, implement batching where we queue up TV jobs in memory and then when we get N of them, fork off a child process to do the work. this reduces the fork overhead by an arbitrary factor, at cost of some potential load imbalance across cores.

the key thing here is that these implementation tasks are separate.

regehr commented 3 years ago

I'll just add that both of these tasks should be fairly easy, given the infrastructure already in place.

batching has two awkward aspects. first, it will be slightly tricky to make the TV report the same as the sequential version. probably the way to do this is for a child process that is going to do N TV jobs inherits N pipe ends. this would allow almost all of the current machinery to work as-is. second, batching will interact badly with a pull request I submitted today which destroys a child process after a timeout expires. I'm not sure what the solution is, but we can worry about it later.

regehr commented 3 years ago

Well, ugh, as I build more projects I'm starting to see deadlocks from the jobserver implementation. I knew there was potential for this, but I was hoping it wouldn't happen. I believe it is an unavoidable consequence of:

both parent and child processes wait on jobserver slots
parent processes wait for all children to exit, before they exit

For now I'm using the unrestricted parallel implementation, which doesn't try to kill my machine as much as I'd have expected :).

But I suppose I'll have to do the pipe/semaphore thing sooner rather than later. @nunoplopes please let me know what you think of this design. You run:

alive-manager -j10 make -j10 CC=alivecc

Now, alive-manager (let me know if anyone has a better name) creates the named pipe and lets make (and its sub-processes) know what it is called using an environment variable. The manager process can exit on its own once "make" finishes.

This isn't lovely, and it gives up global coordination between make-level parallelism and alive-level parallelism, but I think it removes deadlocks (by segregating the two parallel resources) and it does not seem horribly inconvenient to use.

nunoplopes commented 3 years ago

I would love if the manager could be run independently as well (without it having to fork the child processes). This enables scenarios where we compile distinct programs and still achieve global coordination without having a single Makefile for all programs. The advantage of typing the command as you did is that the manager is shutdown when no longer needed, while the "daemon" version need to be shutdown manually. I think we can support both. If you want to be fancy, you can pass the second -j10 to make automatically; no need to duplicate it.

regehr commented 3 years ago

I think this is all easy and I'll do it at some point. doesn't seem super high priority since for now the unrestricted parallel manager seems to work fine -- it does overload the machine sometimes, but in this case the scheduler just round-robins. I've not yet had my machine run out of RAM.

in standalone mode, we could either make the named pipe use a well-known name, in which case you can only run one manager at once, or else it could print some environment variables that you would need to set in other shells, so that new TV processes know who to talk to. nothing hard about any of it.

regehr commented 3 years ago

Just to update and summarize a few things that I've said in emails (plus a few new thoughts), I consider our current challenge to be TV for the MultiSource directory of the LLVM test-suite repo. It contains like 600k lines of C including some very tough targets like sqlite3.c. Even my 128 thread / 256 GB machine can't make a lot of headway at present.

Summarizing our lines of attack:

Nuno's new memory model should reduce the long vcgen times we sometimes see
alive sometimes uses a huge amount of RAM. I'd like to wait to attack this until we're on the new memory model
we don't have to verify after every pass, we can batch them up. I've not done any work along these lines but Juneyoung has. something I don't know is whether we want to try to be smart about when to run TV, or whether we should just run it after every N passes. we might even randomize this a bit to avoid risk of systematically hiding compiler bugs, for example run TV every 1..6 passes where the number is chosen uniformly.
we can amortize fork overhead in the plugin by forking for only every N functions (currently N==1). fork overhead is currently not really a problem, but it could be if alive2 becomes significantly faster
we probably need to call fork() earlier in our runOnFunction(). we currently have a fairly serious sequential bottleneck due to too much code running before we manage to fork. this only matters for files containing a huge number of functions.
~~I'll write the new jobserver described above to control parallelism better, this should fix the OOM problems I see while trying to run this TV job~~ (fixed by https://github.com/AliveToolkit/alive2/pull/638)
another thing that goes badly when processing gigantic files like sqlite3.c is that the main clang process stores a lot of data output by child processes. at worst this ends up being multiple GB, causing fork() to get slower and slower. I wrote it this way since children complete out of order. however, there's no reason why we can't dump this data to a file and free up the associated RAM when enough child processes have finished. I'll make this happen. (fixed by https://github.com/AliveToolkit/alive2/pull/635)

regehr commented 3 years ago

ok, I am asking three questions about our clang plugin:

What are the sequential bottlenecks besides llvm2alive?
Are we calling llvm2alive too many times?
How can we make llvm2alive faster?

Juneyoung's pull req https://github.com/AliveToolkit/alive2/pull/637 should address item 1, but I'll do some testing to make sure, once it lands.

Item 2 is interesting and a bit surprising. I modified tv.cpp so that it maintains a history of function strings it has seen. We call it a cache hit when llvm2alive produces a stringified function that it has produced before, and a cache miss otherwise. This is what I get for a smallish C file that I have sitting around:

regehr@home:~/rb_tree_demo$ grep CACHE out.txt  | sort | uniq -c
  16522 CACHE_HIT
     98 CACHE_MISS
regehr@home:~/rb_tree_demo$

So there is clearly an opportunity for a lot of speedup here. I'm not yet sure what is the best way to do this, though.

I have not looked into item 3 yet.

regehr commented 3 years ago

it's possible that we can use the code in llvm/lib/IR/StructuralHash.cpp to avoid repeating work

AliveToolkit / alive2

next steps for parallel clang plugin #614