Add support for parallel analysis of repositories

Inventitech commented 7 years ago

When running on multiple large repositories, gitlogg used to analyze these in a sequential manner. This is fine as long as disk speed is the bottle neck. Calling git log is a CPU expensive operation, however, so even single SSD machines can benefit from a performance boost by parallelizing the work.

Per default, we start n-1 sub processes, where n is the computer's number of CPUs. This ensures we do not overload the system in case of a desktop computer. With the -n parameter, the user can however specify an arbitrary number of processes to spawn.

This mandated the following changes:

Refactoring out the core part of the script to be able to spawn it via xargs.
Refactoring out the terminal color setup to use it from both the main and the worker script.
Adding a CLI option parser and usage info.

Inventitech commented 7 years ago

Hey @dreamyguy, I also fixed the merge conflicts for you (unfortunate overlap of our work). :+1:

This PR speeds up calculation of a large number of big repositories on machines where there is plenty of disk speed but single CPU performance on git log is the bottleneck. I get about a 10x speed improvement on our 16 core server with SSDs in Raid 5, and about a ~2-3x speed improvement on my laptop.

When you compare this to the previous parallelization suggestion, it has several advantages:

configurable number of processes used (via argument input)
figures out a sensible default for the number of parallel processes
keeps the machine busy by always having n processes work at any given time, hence no more fighting over scheduling resources (by queuing 400 processes simultaneously, unless that's what you want to do and you call it with -n 400)
aborting the script really kills all its children

There are quite a few more things to improve in the Bash part, but let's tackle them one at a time. :+1:

PS. You said you tested it with 470 repos and that took ~3min. This indicates the repositories are fairly small and git log sequentialization is then typically not the bottleneck, so the speed improvement through parallelization should be negligible (if any). Things change when you try it with something like Chromium, LibreOffice, ... ;)

dreamyguy commented 7 years ago

Hi there @Inventitech, I've just tested your changes, and it's all shiny! 👍 ✨

parallelize-ftw

It was about 42% faster on the repos I've tested, on my 4 cores desktop (3,4GHz). I'm in for any performance boost!

As you can see there's only this one line I didn't get, -D doesn't seem to be legal. But that's not a show stopper for now.

Cheers for this update! 🍻 🎉

dreamyguy commented 7 years ago

I'll do some updates to the README when I can, and a release to mark this milestone. 🏆

dreamyguy commented 7 years ago

...and just to answer it within context, yes, the 470 repos are indeed relatively small compared to the likes of LibreOffice. 😜

Inventitech commented 7 years ago

Awesome. More to come in the future :+1:

cspotcode commented 7 years ago

Is xargs guaranteeing that the output from multiple git processes won't be merged in strange ways? This stackoverflow post suggests that there's no guarantee two lines from separate processes won't be interleaved in a strange way.

Inventitech commented 7 years ago

@cspotcode Good catch. By running the actual command as a subprocess in output-intermediate-gitlog.sh, collecting all output for one project and only then returning, the potential for conflict is greatly reduced, but not totally eliminated. In practice, I have not observed such behavior yet.

The simplest solution would be to replace xargs with parallel. @dreamyguy: Would that be a suitable solution for you? However, parallel is not as widespread as xargs (yet).

Alternate solution would involve outputting and flocking directly inside output-intermediate-gitlog.sh.

dreamyguy commented 7 years ago

I have now tested this with many huge repos, and am thinking to run git log without the --no-merges to see if the total commit count is the same as the number of lines on gitlogg.tmp.

The numbers I've got so far with the --no-merges option seem realistically correct.

We're finally past the buffer limitation in V8, which was solved by parsing the JSON through a node stream. I have successfully generated a JSON file for both linux & git repos combined (634,959 & 45,091 commits, respectively). The JSON file became 904,8MB big! 😜

This parallel processing of repos is so handy, I can't imagine not having it. I'm more for putting xargs on ice and revisiting it when/if issues arise...

dreamyguy commented 7 years ago

This really is too bad but xargs seems to be too unreliable. In all my tests involving two or more huge repositories, I got corrupt gitlogg.tmp files, which led to corrupt JSON parsing. I made a script that checks the integrity of the data as it is streamed through to the JSON file, making it stop at undefined values. With that in place, I couldn't get the parsing to succeed.

Having a closer look at the output in gitlogg.tmp, I see that two (or more) processes clash with each other while writing to the file, and at some point one process takes over from another, mixing the beginning of one line with the end of another - in the same line - and therefore "breaking" the data.

It doesn't happen while running it with one repository. Doing it with only one works every time.

I tested it with these repos ([commits count]):

dreamyguy / gitlogg

Add support for parallel analysis of repositories #10