eclipse-jgit / jgit

JGit, the Java implementation of git
https://www.eclipse.org/jgit/
Other
117 stars 33 forks source link

Repack Command not found #8

Open sepatel opened 8 months ago

sepatel commented 8 months ago

Description

I have a need to repack a git repo after it has been closed down. This is due to some of the pack files being over 1gb in size and causing our systems to run out of memory as a result. While locally I can achieve this by doing git repack --max-pack-size=100m -Ad, our production systems do not have git installed so no dropping down to shell environment is allowed. The jgit library is the only means I have to make this adjustment (unless there is an option with clone that I didn't find).

Thus I really need the ability to either repack the repo, or some kind of improved memory handling that causes it to not run out of memory in production when walking the tree of a pack that is really large.

Motivation

It is a core bit of git functionality and would help prevent out of memory issues when working with poorly packed git repositories.

Alternatives considered

Locally I can use git repack --max-pack-size=100m -Ad to work around the memory problems but in production that isn't really an option to run by hand as git is not installed. Only jgit can work with the code.

Additional context

Perhaps a way to clone it with a max pack size? Unsure how that would work as I didn't see a way to do that via the git cli.

msohn commented 6 months ago

JGit accesses objects in pack files via the WindowCache loading the raw data in pages. It doesn't load complete pack files into memory, though it fully caches pack indexes in memory.

Page size and cache size can be configured using the options core.packedGitWindowSizeand core.packedGitLimit. See [1].

Hence I don't understand how running out of memory and repacking pack files is directly related. At the moment JGit doesn't expose an API to only run the repack part of a full gc.

[1] https://github.com/eclipse-jgit/jgit/blob/master/Documentation/config-options.md

sepatel commented 6 months ago

Hence I don't understand how running out of memory and repacking pack files is directly related. At the moment JGit doesn't expose an API to only run the repack part of a full gc.

I cannot say that I can explain why it is, I do know that for some of the repos where there is a single pack file of 800mb or more that the reading of a file from the repo (not always but usually with the older commit ids) leads to an out of memory error. But if I by hand repack the files so that the largest pack file is 100mb, the same commands run fine. It was super hard to track down as the memory of the system (2gb ram systems) jump from around 120mb heap space used to an OOM error within a second or two of some of the known file reads and the stack traces (I don't have any to reference at the moment as this was some time ago now) said something that led me chasing down the pack size as the root issue.

I was never able to tell how much memory was used when I shrunk down the pack size to 100mb because I guess the JVM recovered or did whatever and I don't have a way to intercept how much heap space was being used in the middle of the jgit call only before/after the reading was done.

I'll take a look nd see if the configuration options you've mentioned will be of assistance, maybe it was really those which were the problem and it looked related to the pack size for different reasons.

Edit: @msohn a dumb question but could core.packedIndexGitUseStrongRefs maybe be a thing? It defaults to true, seems to be referenced to packed index, and says in the docs that it'll only drop references when the heap space is low if it is set to false which is not the default? Could perhaps the indexing of the packed files be the thing using up excessive amounts of RAM?

msohn commented 6 months ago

If core.packedIndexGitUseStrongRefs=true the jgit pack index cache uses strong references to cache the pack index data. This has the consequence that the JVM cannot free the memory used for caching loaded pack indexes when it runs short on free heap space. You can try to set this option to false to use soft references instead which allows the JVM to reclaim the memory used to cache pack indexes. This may reduce memory consumption but will slow down access to pack index content since it needs to be reloaded from the filesystem if the JVM removed softly referenced objects from the heap.

If you need more details you probably need to create heap dumps and analyze them e.g. using Eclipse memory analyzer.