dreamyguy / gitlogg

💾 🧮 🤯 Parse the 'git log' of multiple repos to 'JSON'
MIT License
130 stars 27 forks source link

Add support for parallel analysis of repositories #10

Closed Inventitech closed 7 years ago

Inventitech commented 7 years ago

When running on multiple large repositories, gitlogg used to analyze these in a sequential manner. This is fine as long as disk speed is the bottle neck. Calling git log is a CPU expensive operation, however, so even single SSD machines can benefit from a performance boost by parallelizing the work.

Per default, we start n-1 sub processes, where n is the computer's number of CPUs. This ensures we do not overload the system in case of a desktop computer. With the -n parameter, the user can however specify an arbitrary number of processes to spawn.

This mandated the following changes:

Inventitech commented 7 years ago

Hey @dreamyguy, I also fixed the merge conflicts for you (unfortunate overlap of our work). :+1:

This PR speeds up calculation of a large number of big repositories on machines where there is plenty of disk speed but single CPU performance on git log is the bottleneck. I get about a 10x speed improvement on our 16 core server with SSDs in Raid 5, and about a ~2-3x speed improvement on my laptop.

When you compare this to the previous parallelization suggestion, it has several advantages:

There are quite a few more things to improve in the Bash part, but let's tackle them one at a time. :+1:

PS. You said you tested it with 470 repos and that took ~3min. This indicates the repositories are fairly small and git log sequentialization is then typically not the bottleneck, so the speed improvement through parallelization should be negligible (if any). Things change when you try it with something like Chromium, LibreOffice, ... ;)

dreamyguy commented 7 years ago

Hi there @Inventitech, I've just tested your changes, and it's all shiny! 👍 ✨

parallelize-ftw

It was about 42% faster on the repos I've tested, on my 4 cores desktop (3,4GHz). I'm in for any performance boost!

As you can see there's only this one line I didn't get, -D doesn't seem to be legal. But that's not a show stopper for now.

Cheers for this update! 🍻 🎉

dreamyguy commented 7 years ago

I'll do some updates to the README when I can, and a release to mark this milestone. 🏆

dreamyguy commented 7 years ago

...and just to answer it within context, yes, the 470 repos are indeed relatively small compared to the likes of LibreOffice. 😜

Inventitech commented 7 years ago

Awesome. More to come in the future :+1:

cspotcode commented 7 years ago

Is xargs guaranteeing that the output from multiple git processes won't be merged in strange ways? This stackoverflow post suggests that there's no guarantee two lines from separate processes won't be interleaved in a strange way.

Inventitech commented 7 years ago

@cspotcode Good catch. By running the actual command as a subprocess in output-intermediate-gitlog.sh, collecting all output for one project and only then returning, the potential for conflict is greatly reduced, but not totally eliminated. In practice, I have not observed such behavior yet.

The simplest solution would be to replace xargs with parallel. @dreamyguy: Would that be a suitable solution for you? However, parallel is not as widespread as xargs (yet).

Alternate solution would involve outputting and flocking directly inside output-intermediate-gitlog.sh.

dreamyguy commented 7 years ago

I have now tested this with many huge repos, and am thinking to run git log without the --no-merges to see if the total commit count is the same as the number of lines on gitlogg.tmp.

The numbers I've got so far with the --no-merges option seem realistically correct.

We're finally past the buffer limitation in V8, which was solved by parsing the JSON through a node stream. I have successfully generated a JSON file for both linux & git repos combined (634,959 & 45,091 commits, respectively). The JSON file became 904,8MB big! 😜

This parallel processing of repos is so handy, I can't imagine not having it. I'm more for putting xargs on ice and revisiting it when/if issues arise...

dreamyguy commented 7 years ago

This really is too bad but xargs seems to be too unreliable. In all my tests involving two or more huge repositories, I got corrupt gitlogg.tmp files, which led to corrupt JSON parsing. I made a script that checks the integrity of the data as it is streamed through to the JSON file, making it stop at undefined values. With that in place, I couldn't get the parsing to succeed.

Having a closer look at the output in gitlogg.tmp, I see that two (or more) processes clash with each other while writing to the file, and at some point one process takes over from another, mixing the beginning of one line with the end of another - in the same line - and therefore "breaking" the data.

It doesn't happen while running it with one repository. Doing it with only one works every time.

I tested it with these repos ([commits count]):

[1,000,000] https://github.com/cirosantilli/test-many-commits-1m [ 634,959] https://github.com/torvalds/linux [ 424,810] https://github.com/CyanogenMod/android_kernel_yu_msm8916 [ 399,784] https://github.com/LibreOffice/core [ 106,402] https://github.com/odoo/odoo [ 96,062] https://github.com/NixOS/nixpkgs [ 70,822] https://github.com/Homebrew/homebrew-core [ 60,224] https://github.com/rails/rails [ 45,091] https://github.com/git/git [ 23,493] https://github.com/django/django