basepi / libgit2

The Library
http://libgit2.github.com
Other
0 stars 0 forks source link

Implement core diff function #3

Open hausdorf opened 13 years ago

hausdorf commented 13 years ago

Tentatively leaning towards Patience. We will need to look at both that and the classical diff algorithm.

This step will mostly be composed of writing the core diff function, possibly in both algorithms, and profiling each. The next step will be integrating it with the protocols (e.g., the internal diff protocol used to merge etc.).

The goals are:

trane commented 13 years ago

http://kerneltrap.org/mailarchive/git/2009/1/7/4632964 <-- Read the comments to this.

trane commented 13 years ago

Basically, it comes down to this: Patience might more often cause merge conflicts than classical diff. I think we should remove that possibility of merge conflicts and just implement classical diff.

hausdorf commented 13 years ago

Andrew and I will be meeting sometime on Friday to hash this out BTW.

trane commented 13 years ago

http://www.xmailserver.org/xdiff-lib.html Source for this is available on that website. Also, http://code.google.com/p/google-diff-match-patch/

trane commented 13 years ago

I think I am beginning to lean towards Patience, for many reasons... in that second link (that you somehow thought was awesome to put three links in the word Patience), Bram responds to those comments from the mailing list I posted with some compelling arguments. I don't think that we would find issues with merge conflicts any more with Patience vs classic LCS. In which case, Patience seems like a more elegant, less complex solution. Also, looking at the algorithm, it might be possible to thread it at one point: Where the lines that are pulled out from the file that only occur once and then compare and then the rest of the algorithm. However, threading a diffing algorithm seems silly as it could lead to all sorts of false-positives from one thread to the other and would probably be slower in most cases. The idea of threading on a file by file basis (where one thread does one file, another does one file, etc...), does not work in the git world. Files don't exist according to git, they are all just content with deltas -- which makes the ability to follow lines that move from one file to another so "easy". Basically, if we are going to find a way to make it faster, it will be by decreasing the clock-cycles it takes to do any operation. Git's source already has a lot of special optimizations that we could look at, including that awesome Rabin Fingerprint lookup table.

hausdorf commented 13 years ago

I don't think we'll be able to parallelize it effectively. On the plus side, git is pretty effective at diffing, and they get their code almost verbatim from libxdiff, so maybe we should do the same. Bazaar does not focus on speed, they focus on usability, so git's diffing may actually be better for our purposes.

trane commented 13 years ago

I think I might have figured out a way to speed this up, but it will require that git doesn't already do this: memoization in the LCS. Here's my thought... Most instances of LCS are going to result in non-differing results, which is expensive when we must fill in all buckets of the two dimensional set of possible outcomes. If we memoize the results of each comparison, we can cut NxM outcomes into the only the diagonal tuples of that cross product. The problem lies in instances where the strings are very different, in which case memoization is more expensive than plain recursion. I think I have a good solution to this however... Consider the most common pairing of selection/insertion sort with quicksort. If n < 15..17, use the selection/insertion sort, otherwise use quicksort. My idea is that we only use memoization when two things happen: 1) the length of the last T characters are equal (this might require some profiling to determine T, as gains could be met by using as little as 1-2 characters, or maybe more) 2) the length of the lines are the same and are longer than Q characters (again, Q would be obtained by profiling) If either of those two conditions are not met, then we use classic non-memoized LCS. There might be better ways to determine when we memoize, but these naive conditions might yield good results.

trane commented 13 years ago

OK. Reading the code for xdiff, it does something way cooler, imo... (which is what is proposed in the Eugene Myers paper) It divides the problem by walking SW and NE at the same time and determines whether a change has occurred, thus removing the extra complexity for obviously similar/different lines. /me reads more

trane commented 13 years ago

Alright, here is a threading idea...This might actually work. According the Myers, "a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B". However, if D is large this algorithm turns into a O(NlgN + D^2) problem, which the idea of using O(ND) space is a bad thing and the user should consider a different algorithm. Here is my idea...

Let's assume that D is large. Let's also assume that for most instances vertical neighboring lines will be more likely to match than lines further apart. Thus, we can work on this hypothesis: if neighboring lines are similar they will remain similar across commits and thus can be considered in partially sorted order. This is where threading comes in...

If we assume there is at least partial order, we can divide and conquer. Create two threads, one that diffs the upper half of lines and one that diffs the lower half of lines. Remaining non-matching lines will then be diff'd after the two threads join.

This again, would only potentially work if D is large.

trane commented 13 years ago

This in no way changes the O(NlgN + D^2), it merely divides the constant in front by a factor of (2 - (cost of thread spin/join) - L) -- L being some number representing the potential worst-case LCS time cost.

trane commented 13 years ago

I misrepresented D, D is referred to as a line... where I misrepresent it as a sequence of lines. I have edited to comments to reflect that.

trane commented 13 years ago

Actually this doesn't require that D be large... since determining D is merely determining the line. In order to use this threading we would just always do it without any idea of what D is. So, forget any reference to D at all. It would be based on line count alone.

trane commented 13 years ago

Not sure how we missed this: http://www.xmailserver.org/xdiff.txt

hausdorf commented 13 years ago

chastore and chanode appear to not be called anywhere. So I'm not going to implement them.

hausdorf commented 13 years ago

Just kidding. It's used somehow in in xdl_prepare_ctx().

hausdorf commented 13 years ago

Figured it out. You start by initializing a chastore with xdl_cha_init(). You tell it what sort of structs you want to store (e.g., xdfile_t structs). From there, you pass the chastore into xdl_cha_alloc(), and it dynamically allocates memory for those structs.

A chanode is basically used to iterate through this memory. For example, to allocate memory to these structs we specifically malloc it all ahead of time and then use chastore->ancur iterate through this memory, handing off addresses of memory until it runs out. chastore->sncur, in contrast, is a free iterator, as far as I can tell.

If you're interested see comments to the git repo I made here.

hausdorf commented 13 years ago

That, by the way, was about 5 or 6 hours of solid work.

trane commented 13 years ago

This is an important thing that we already know, for examining code for hours, but it is spelled out neatly by Davide himself in xpatience. Basically, the 'ha' variable that took forever to figure out, it is converted into the weighted values of possible paths, where ha[0] == 0, and ha[n+1] == 0 OR 1.

    /*
     * After xdl_prepare_env() (or more precisely, due to
     * xdl_classify_record()), the "ha" member of the records (AKA lines)
     * is _not_ the hash anymore, but a linearized version of it.  In
     * other words, the "ha" member is guaranteed to start with 0 and
     * the second record's ha can only be 0 or 1, etc.
     *
     * So we multiply ha by 2 in the hope that the hashing was
     * "unique enough".
     */
hausdorf commented 13 years ago

Noted, thanks.