Kentzo / git-archive-all

A python script wrapper for git-archive that archives a git superproject and its submodules, if it has any. Takes into account .gitattributes
MIT License
372 stars 81 forks source link

Optimize is_file_excluded with pygit2 if available #51

Closed legoktm closed 5 years ago

legoktm commented 6 years ago

For repositories with many files, is_file_excluded() is the biggest bottleneck since it has to be called for each file. git-check-attr is actually pretty fast, so most of the time is just spent in the process of shelling out to git.

We can use the pygit2 library (a wrapper around libgit2) if it's available for a much faster check. In my testing of MediaWiki tarball generation (a rather large case), is_file_excluded went from 117 seconds of wall clock time to 1.5 seconds!

Since pygit2 can be a bit tricky to install as you need to have a matching libgit2 version, only use it if it's already installed and fall back to the current behavior of shelling out if not.

There are some other calls to git that could also use pygit2, however in my profiling, none of those appear as hotspots, and the cost of shelling out is negligible compared to the amount of time the command itself takes.


I hope it's OK to optionally depend upon an external library like this. It wasn't very straightforward for me to install it (had to manually install a slightly older version, since Fedora is not using the very latest libgit), so I didn't think it would be that great to have a hard dependency on it.

legoktm commented 6 years ago

The test failure (https://travis-ci.org/Kentzo/git-archive-all/jobs/431358381) looks like an issue with travis-ci.

Kentzo commented 6 years ago

Alternatively, git-archive-all could aggregate all the files and then check them all at once.

Perhaps it should do both.

Kentzo commented 6 years ago

@legoktm Please try the version from the check-attr branch.