denio7 / egit

Automatically exported from code.google.com/p/egit
0 stars 0 forks source link

JGit tree walk performance is low on large repositories #108

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I use jgit tree walk (and an visitor implementation) to find changed files
between two revisions, and in general it performs ok. However when number
of files in a repository became high  (e.g. 30 000+ files) the performance
of the walk slows down close to linear manner. I Compare it to an outer
process call of git command ("git diff-tree"): In our application we must
parse lot of change sets (~155 000, with 30 000 files in the repo) and with
jgit implementation it takes ~6 hours to achive a part of it while with a
call to git cmd it takes ~27 minutes to reach the same point.

1. Is there any specific way to speed up the tree walk process? I use the 3
tree version (index, rev 1, rev2) but it might be better if there will a 2
revision trees comparing version.

Platform: JGit 0.5.0, on Ubuntu linux (java development under eclipse).

Original issue reported on code.google.com by berkesa...@gmail.com on 3 Aug 2009 at 8:52

GoogleCodeExporter commented 8 years ago
I've put a lot of time into trying to optimize TreeWalk, but yea, it doesn't 
perform 
as good as C git does.  Its hard to get that level of performance in Java.

Two things come to mind:

Are you pruning unchanged subdirectories?  A TreeWalk which contains a 
DirCacheIterator and two canonical trees (rev1, rev2) should be able to take 
advantage of TreeFilter.ANY_DIFF to skip subtrees which are identical between 
all 
three.  Iterating into identical subtrees would significantly slow down the 
walk.

Does the DirCache you are loading from contain the 'TREE' extension?  JGit 
doesn't 
always produce the 'TREE' extension when it writes the .git/index, but 
DirCacheIterator can really benefit from having it.  Without the 'TREE' 
extension 
present DirCacheIterator will always report a subtree has having SHA-1 0{40}, 
which 
means the ANY_DIFF filter will always look into the subtree.

Original comment by sop+code@google.com on 8 Sep 2009 at 2:48

GoogleCodeExporter commented 8 years ago
I just assume we are a lot faster now. Performance is always high on Shawns 
agenda.

Original comment by robin.ro...@gmail.com on 12 Jul 2010 at 5:25

GoogleCodeExporter commented 8 years ago
Thank you Robin for your comment. Where can I download the latest jgit jar? On 
eclipse.org I did find the jgit project however no download link.

Regards,

Zsolt

Original comment by zkopp...@gmail.com on 13 Jul 2010 at 9:12

GoogleCodeExporter commented 8 years ago
On http://www.eclipse.org/jgit/download/ there is a link
to org.eclipse.jgit.jar (Raw API library), which will take you
to 
http://download.eclipse.org/jgit/maven/org/eclipse/jgit/org.eclipse.jgit/0.8.4/o
rg.eclipse.jgit-0.8.4.jar

But, that said, I doubt this is magically fixed.  We haven't
done anything here that would make a significant change.
You never answered my comment from Sep 08 2009, so
I just ignored this.

Original comment by spea...@spearce.org on 13 Jul 2010 at 9:25