eclipse-jgit / jgit

JGit, the Java implementation of git
https://www.eclipse.org/jgit/
Other
121 stars 34 forks source link

Rework delta handling code in order to support large repositories #81

Open schrepfler opened 1 month ago

schrepfler commented 1 month ago

Description

Currently libraries that use jgit like egit suffer from TooLargeObjectInPackException when the invoked repository contains larger files. Arguably this is not the way how to use source control but these things do happen.

Caused by: org.eclipse.jgit.errors.TooLargeObjectInPackException: Object too large (2,887,318,710 bytes), rejecting the pack. Max object size limit is 2,147,483,639 bytes.

I believe default limit should be to mimic whatever limit git has and set it to that, and if it's something higher should be raised to the higher value and ultimately it should be possible to disable the check.

As mentioned here the delta handling code requires the target to be a single Java byte array, maybe figure out alternative implementation or code path in order to support bigger repositories.

Motivation

Repositories with large files are unfortunately a fact of life, especially since hosted git lfs solutions come at a premium many people opt out to host large files in git.

Alternatives considered

No response

Additional context

No response

tomaswolf commented 3 weeks ago

This is not trivial. The basic problem is that a delta is composed of COPY and INSERT instructions, and COPY instruction may copy data from the base out of order. See e.g. the comment at https://github.com/eclipse-jgit/jgit/blob/299a7348eb318a0199226c1e633cc46c659d76d3/org.eclipse.jgit/src/org/eclipse/jgit/util/io/BinaryDeltaInputStream.java#L21 So one needs efficient random access to the whole base. A COPY instruction has the format "COPY offset length" and says "copy length bytes from the base, starting at offset, to the output". Offset is an uint32, so limited to 4GB, while length is in the range [1 .. 2^24-1].

There was an attempt to stream the base, but it turned out to be too slow. See commit 62697c8d and the mail referenced in that commit comment.

Also see the comments on Gerrit change 190382.

For applying binary patches, C git has limit of 1024 1024 1023 bytes, a little less than 1GB. See https://github.com/git/git/blob/b9849e4f7631d80f146d159bf7b60263b3205632/apply.c#L414 .

For delta-compression in pack files, I see no such limit on the length. There is a limit on the copy length of just 64kB, though: https://github.com/git/git/blob/b9849e4f7631d80f146d159bf7b60263b3205632/diff-delta.c#L432 . (For pack v2)

Given that the offset in a COPY instruction is limited to 4GB, one actually "only" needs fast random access to the first 4GB of a base. Perhaps just using multiple arrays (as mentioned in Gerrit change 190382) to cover these first 4GB might be a way. Of course, it might need 4GB (plus some more) of JVM heap...

Another idea from that Gerrit change was to apply the 2GB limit only to deltas. But that might give strange effects. (Blob can be handled initially if not delta compressed, but cannot be handled after repacking, when it might have become delta-compressed?)