ScottDuckworth / python-anyvcs

A Python abstraction layer for multiple version control systems
BSD 3-Clause "New" or "Revised" License
11 stars 4 forks source link

object-based commit lookups cause errors in GitRepo.ls(report=['commit']) #40

Closed OEP closed 10 years ago

OEP commented 10 years ago

I noticed a while back anyvcs seems to mis-report the commit for identical files created on different commits. To illustrate the problem, I wrote unit tests:

https://github.com/OEP/python-anyvcs/tree/feature/fix-copy-cache

(this test also exposes an unrelated problem that HgRepo doesn't seem to report the commit at all)

The problem makes sense; the docs say git object names are based on the content of the file, so it makes sense that git would only store identical copies once in its database. We're using object names to cache a 'last-changed' revision, but objects are not going to be 1-to-1 with a revision.

I haven't really thought of a fix yet. I thought we could use a SHA-1 hash of the object name and path, but that doesn't cover the case where a file is created, changed, and reverted (each occurs in different revisions).

ScottDuckworth commented 10 years ago

I pulled your feature branch into branch feature/fix-copy-cache of this repo. With bug #42 fixed, HgCopyTest fails test_ls1 but it passes test_ls2. To try and understand this I reproduced the test case via command line:

$ hg init
$ echo Sherlock > main
$ hg add main
$ hg ci -m '1: create main'
$ echo Sherlock > copy
$ hg add copy
$ hg ci -m '2: create copy'
$ hg st
$ echo Sherlock > revert
$ hg add revert
$ hg ci -m '3: create revert'
$ echo Watson > revert
$ hg ci -m '4: change revert'
$ echo Sherlock > revert
$ hg ci -m '5: revert revert'

Here are the manifests. Note that the numbers in the comments are one based but hg revision numbers are zero based.

$ hg manifest --debug -r 0
b4cf127712d354824a8b9db2fe0040932cabebbe 644   main
$ hg manifest --debug -r 1
b4cf127712d354824a8b9db2fe0040932cabebbe 644   copy
b4cf127712d354824a8b9db2fe0040932cabebbe 644   main
$ hg manifest --debug -r 2
b4cf127712d354824a8b9db2fe0040932cabebbe 644   copy
b4cf127712d354824a8b9db2fe0040932cabebbe 644   main
b4cf127712d354824a8b9db2fe0040932cabebbe 644   revert
$ hg manifest --debug -r 3
b4cf127712d354824a8b9db2fe0040932cabebbe 644   copy
b4cf127712d354824a8b9db2fe0040932cabebbe 644   main
4c5c624157b347f3f18616fd89053fe4ed281936 644   revert
$ hg manifest --debug -r 4
b4cf127712d354824a8b9db2fe0040932cabebbe 644   copy
b4cf127712d354824a8b9db2fe0040932cabebbe 644   main
4207cd7c0c091f6f813023aa666f0f93ca24abde 644   revert

According to the docs in the How do Mercurial hashes get calculated? section, hashes are a function of file contents and the hash of its parents, (I don't quite understand how the parents are factored in). The manifest at revision 4 shows that the hash of revert is different that of main, presumably because the hash of its parents are factored in (whatever that means), and this is why test_ls2 passes.

Regardless, this is a problem that will take some work to fix.

OEP commented 10 years ago

It seems kind of weird to me that, despite having different history, copy shares a nodeid with main, yet revert is special because that path has been in the history at some point; I guess it kind of makes sense the more I read. This page about revlogs says basically each path in the repository have its own separate bookkeeping; it just happens that on the first introduction of a path, two files with the same contents on different revisions have the same nodeid's. So, at least for Mercurial, I think it may be sufficient to use hash(path+nodeid) as the key for the nodeid->changeset conversion.

There is some information about how the parent revisions are used but I'm not sure it's helpful.

ScottDuckworth commented 10 years ago

I tried to fix this problem using a similar approach that we use in hg with the files-cache.log.

Here's how it works in hg: files-cache.log contains a list of commits, each listing their parents and which files were changed in that commit, and are ordered by the revision number. New commits can be added to the log by appending the output of a single hg command to this file. This file is parsed into an array indexed by the revision number. To find the commit that last changed each file, the DAG is traversed in python from the starting point back through the parents, at each step looking for changes on any of the files which have not yet been found, until the last change to all files have been found. This is much faster than using individual hg commands to find the last change for each file (this traverses the DAG for each file instead of just once).

For git, there are no revision numbers and ordering by committer date is not possible (because it may have been committed on another repo a long time before being pushed to this repo), so there is no way to do a linear ordering like we do in hg's files-cache.log. However, I was able to store the required data for each commit (its parents and changed files) in a HashDict keyed by the commit hash. A cache miss on this HashDict requires running a git show command on a single commit.

This works, and solves the problem, but it is painfully slow for large repos. I tested repo.ls('HEAD', '/', report=['commit']) on the official git repo and let it run for ~10 minutes of warming up the cache before I killed it, at which point it only had about 8000 entries in it (of the >36000 commits in the repo).

So here's the situation:

Back to the drawing board...

ScottDuckworth commented 10 years ago

Maybe someone will help us out: http://stackoverflow.com/q/21314493/666619

ScottDuckworth commented 10 years ago

I have tried multiple approaches using libgit2, using the C API and pygit2, and none of them have been what I would consider "fast enough." I got roughly what we need, albeit with a few bugs, using the C API down to ~10 seconds for git.git.

To try and figure out how something like git log HEAD -- file.c is so fast, I used strace. That tells me that they are not even attempting to access individual commit objects on the file system, but rather going directly to the pack files (which are mmap'ed).

Using strace on my programs that use libgit2 (both in C and Python) shows that each commit object is trying to be accessed on the file system, even though it's not there (it's in the pack file). I see a bunch of these:

access("/local/sduckwo/git.git/objects/83/ba6d8aa00c4b0d7ae10500fa04cfaf1ee4d155", F_OK) = -1 ENOENT (No such file or directory)
access("/local/sduckwo/git.git/objects/ee/d10dcbed7a53968a322849650287642b3bf23d", F_OK) = -1 ENOENT (No such file or directory)
access("/local/sduckwo/git.git/objects/64/841a9faeb8ee6ab90b75457768c54b46b4abf6", F_OK) = -1 ENOENT (No such file or directory)
access("/local/sduckwo/git.git/objects/83/ba6d8aa00c4b0d7ae10500fa04cfaf1ee4d155", F_OK) = -1 ENOENT (No such file or directory)

I see this even when using the gitrevwalk* functions and I don't see any functions that implement read access to pack files - only writing pack files is in that API.

So I think the answer to this problem is going to be either some aggressive caching or some really really low-level access to the git pack files.

OEP commented 10 years ago

As far as how GitHub page loads are so fast, it looks like there solution involves caching the rendered HTML for each commit and path into the repo. If there's a cache miss it looks like the browser is sent off to fetch the file list at a different URL, which can take 2-3 seconds in the worst case. Subsequent page loads don't seem to do this, they're more like one page.

I thought it might be a good idea to see what their API offers too, but there's nothing that lists commits alongside files that I can see.

So it's not cheap enough to do in a user's main request and it's not cheap enough to expose it in a public API. I think that supports your conclusion. Still I don't know about the 2-3s cache warmup time.

ScottDuckworth commented 10 years ago

Fixed in 267fc082aa3013b196e37b710c945a11111b912b