Parse the 'git log' of one or several 'git' repositories into a sanitised and distributable 'JSON' file.
git log
is a wonderful tool. However its output can be not only surprisingly inconsistent, but also long, difficult to scan and to distribute.
Gitlogg sanitises the git log
and outputs it to JSON
, a format that can easily be consumed by other applications. As long as the repositories being scanned are kept up to date, Gitlogg will return fresh data every time it runs.
git log
can only be used on a repository at a time.git log
can't be easily consumed by other applications in its original format.git log
doesn't return impact, which is the cumulative change brought by a single commit. Very interesting graphs can be built with that data, as shown on sidhree.com.subject
, need to be sanitised to be consumed.--stat
or --shortstat
are currently not available as placeholders under --pretty=format:<string>
, and it is cumbersome to get commit logs to output neatly in single lines - with stats.Gitlogg is not a very complex application, but I still made an effort to provide some feedback on what is happening under the hood. Below are some screenshots of dialogs one can expect to see while executing it:
Øh nøes! The path to the folder containing all repositories does not exist!
Øh nøes! The path to the folder containing all repositories exists, but is empty!
Success!
JSON
parsed, based on 9 different repositories with a total of 25,537 commits.
Note that I've included two huge repos (react & react-native, that have 7,813 & 10,065 commits respectively at the time of this writting) for the sake of demonstration. The resulting parsed JSON
file has 715,040 lines. All that done in less than 25 seconds.
I have successfully compiled 470
repositories at once (all repos under the organization I work for). Then I got these specs:
gitlogg.tmp
generated in 154s
(~2.57mins
)JSON
output parsed in 2792ms
JSON
file size: 121,5MB
118,117
JSON
file, lines: 3,307,280
Gitlogg requires NodeJS and BabelJS.
NodeJS
(visit their page to find the right install for your system).npm run setup
. That will:BabelJS
globally by running npm install babel-cli -g
.npm install
.JSON
will be at (only on Simple Mode).JSON
outputThe output will look like this (first commit for Font Awesome):
[
{
"repository": "Font-Awesome",
"commit_nr": 1,
"commit_hash": "7ed221e28df1745a20009329033ac690ef000575",
"author_name": "Dave Gandy",
"author_email": "dave@davegandy.com",
"author_date": "Fri Feb 17 09:27:26 2012 -0500",
"author_date_relative": "4 years, 3 months ago",
"author_date_unix_timestamp": "1329488846",
"author_date_iso_8601": "2012-02-17 09:27:26 -0500",
"subject": "first commit",
"subject_sanitized": "first-commit",
"stats": " 1 file changed, 0 insertions(+), 0 deletions(-)",
"time_hour": 9,
"time_minutes": 27,
"time_seconds": 26,
"time_gmt": "-0500",
"date_day_week": "Fri",
"date_month_day": 17,
"date_month_name": "Feb",
"date_month_number": 2,
"date_year": "2012",
"date_iso_8601": "2012-02-17",
"files_changed": 1,
"insertions": 0,
"deletions": 0,
"impact": 0
},
{
(...)
},
{
(...)
}
]
Note that many git log
fields were not printed here, but that's only because I've commented out some of them in the gitlogg-parse-json.js script. All the fields below are available. Fields marked with a *
are either non-standard or not available as placeholders on --pretty=format:<string>
:
* repository
* commit_nr
commit_hash
commit_hash_abbreviated
tree_hash
tree_hash_abbreviated
parent_hashes
parent_hashes_abbreviated
author_name
author_name_mailmap
author_email
author_email_mailmap
author_date
author_date_RFC2822
author_date_relative
author_date_unix_timestamp
author_date_iso_8601
author_date_iso_8601_strict
committer_name
committer_name_mailmap
committer_email
committer_email_mailmap
committer_date
committer_date_RFC2822
committer_date_relative
committer_date_unix_timestamp
committer_date_iso_8601
committer_date_iso_8601_strict
ref_names
ref_names_no_wrapping
encoding
subject
subject_sanitized
commit_notes
* stats
* time_hour
* time_minutes
* time_seconds
* time_gmt
* date_day_week
* date_month_day
* date_month_name
* date_month_number
* date_year
* date_iso_8601
* files_changed
* insertions
* deletions
* impact
JSON
fileThere are two modes and they are basically the same, except that the Simple Mode doesn't require configuration. The Advanced Mode requires one to set the absolute path to the directory containing all the repositories you'd like to parse to a single JSON
file.
To simplify the generation process to a point that no configuration is required, follow this directory structure:
gitlogg/ <== This repository's root
├── scripts/
│ ├── colors.sh
│ ├── gitlogg-generate-log.sh
│ ├── gitlogg-parse-json.js
│ └── gitlogg.sh
└── _repos/ <== Copy/place/keep your repositories under the folder "_repos/"
├── repo1
├── repo2
├── repo3
└── repo4
Copy the all the repositories you wish to parse to JSON
to the _repos/
folder, as shown above.
Granted that you are within the gitlogg
folder (this repo's root), run:
$ npm run gitlogg
To generate the JSON
file based on repositories in any other location, you'll have to define the path to the folder that contains all your repositories.
Open gitlogg-generate-log.sh
with an editor of your choice and edit the yourpath
variable:
# define the absolute path to the directory that contains all your repositories
yourpath=/absolute/system/path/to/directory/that/contains/all/your/repositories/
Tip: drag the folder that contain your repositories to a terminal window, and you'll get the absolute system path to that folder.
Granted that you are within the gitlogg
folder (this repo's root), run:
$ npm run gitlogg
The parallel processing that was released on v0.1.8 had problems with xargs
and was temporarily removed. The issue is being dealt with through pull-request #16.
JSON
fileTwo files will be generated when running
npm run gitlogg
:_tmp/gitlogg.tmp
and_output/gitlogg.json
.
gitlogg/ <== This repository's root
├── scripts/
│ ├── colors.sh
│ ├── gitlogg-generate-log.sh
│ ├── gitlogg-parse-json.js
│ └── gitlogg.sh
├── _output/
│ └── gitlogg.json <== The parsed 'JSON', what we're all after. It's parsed from 'gitlogg.tmp'
└── _tmp/
└── gitlogg.tmp <== The processed 'git log'
Two files were necessary because of the nature of the script, that loops through all subdirectories and outputs the git log
for all valid git
repositories. Once that loop is done, a valid JSON
file (gitlogg.json
) is generated out of gitlogg.tmp
.
gitlogg.tmp
is just a temporary file from which gitlogg.json
bases itself on. In case the parsing fails gitlogg.tmp
can come in handy for debugging.
I've created error messages with suggested solutions, to help you get past the most common issues.
However, git log
's output can break while it's being processed. That's most certainly caused by fields that allow user input, like commit messages. These fields may contain characters (like \r
) that crash with those reserved for the generation of gitlogg.tmp
, namely \n
.
Efforts have been made to mitigate errors by sanitizing characters that have caused errors before, but it might still happen in some edge cases. If it does happen, have a look at the generated gitlogg.tmp
and see if the expected structure (which is obvious) breaks. Once you have identified the line, have a closer look at the commit and look for an unusual character.
Post an issue with a link to a gist containing your broken gitlogg.tmp
and I will try to reproduce the error.
Documentation is done either by:
README.md
files, like this one.Some of the initial commits were done deliberately to show what one gets with short commands like $ git log
. From that initial state commits keep on introducing simplicity or complexity to the code, depending on the work flow. That in itself is a form of documentation. In other words, if you're really that interested in details, there are plenty to be had in the code itself and in its own progressive enhancement.
This project is by no means the smartest way to parse a git log
to JSON
, nor does it aim at becoming so. It is simply a learn-by-doing project in which I experiment with commands available on OSX's Terminal and whatever else I find along the way.
Gitlogg was built and tested on OSX. Though an effort has been done to make it cross-platform, there could be errors on other systems.
It's certainly not harmful to your repositories and it won't change any data in it. Having said that, it's served raw and 'as is'. You may get support, but don't expect it nor take it for granted.
There are no known issues at this point. The parallelization that was introduced on v0.1.8 had issues with xargs
, so its introduction was temporarily reverted until the problem has been dealt with through pull-request #16. v0.1.9 was released to revert those changes.
The javascript branch is a very fine piece of programming; you should definitely check it out. I haven't tested it extensively, but found a few issues, which are reported in the issue tracker.
The current version v0.2.1 is still quite stable after all these years, with no known issues. Try it! :sparkles:
ȝ
instead of \0
when replacing \n
during the extraction of git log
. \0
is not as reliable as it seemed.git
context.ȝ
(Yogh) is an old English character. If that gives problems, I'll try ƿ
(Wynn), another abandoned English char.JSON
format.JSON
, in some scenarios quite dramaticallyMongoDB
, which is what is being used on gitlogg-api\0
instead of ò
when replacing \n
during the extraction of git log
.git
context.xargs
has been dealt with.JSON
through a read/write stream, so we get around the 268MB Node
's buffer limitation.git log
for multiple repos, optionally passing number of processes as a CLI argument.ISO-8859-1
characters not being properly encoded to UTF-8
.commit_nr
, a commit count within each repoBrought to you by Wallace Sidhrée.