apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Create a (cleaned up) SVN history in git [LUCENE-6933] #7991

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

Goals:

Non goals:

Impossible:


Migrated from LUCENE-6933 by Dawid Weiss (@dweiss), resolved Dec 29 2015 Attachments: multibranch-commits.log Linked issues:

asfimport commented 8 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

For some reference, here is a wiki page managing Mavens migration to git: https://cwiki.apache.org/confluence/display/MAVEN/Git+Migration

Here is one of the infra JIRA's: https://issues.apache.org/jira/browse/INFRA-5266 Migrate Maven subprojects to git (surefire,scm,wagon)

Not all very relatable to us in a lot of ways, but a root to get into INFRA tickets for a past migration.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks Mark. I don't think I'll use automated scripts, I'll most likely put together something that will translate raw history revision-by-revision (cleaning up the dump local SVN first). It can take a long time if it's a one-time conversion. I realize it's mind-bending, but let's see if it works. I'll need some time to work through it, these are huge files.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Also, renames-with-modifications may not be picked up correctly for linear file history, like this one (r894959):

Node-path: lucene/solr/branches/cloud/src/java/org/apache/solr/cloud/CountdownWatcher.java
Node-kind: file
Node-action: add
Node-copyfrom-rev: 892824
Node-copyfrom-path: lucene/solr/branches/cloud/src/java/org/apache/solr/util/zookeeper/CountdownWatcher.java
Text-delta: true
Text-delta-base-md5: ba60152c2bd0eebe18755e4c555a62d2
Text-content-length: 924
Text-content-md5: f7173659406b1d7e07ee024cbaf78506
Content-length: 924 
(diff of changes)
asfimport commented 8 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Most of these issues are not unique to us though right? I assume that a lot of projects that have converted from svn to git have had to just bite the bullet on these issues?

Is it easy to show how widespread the issues are? (eg how many classes are affected)

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't know, to be honest. I think most conversions are done automatically using git-svn and if there are any erroneous merges they're never detected. Note this doesn't mean the final files are incorrect – they will be the same, it's just the history that is different. git doesn't "track" paths, it tracks changes. So you can have two (slightly different) files that will be detected as a "copy-with-change" operation.

I think the whole conversion should be based on the "best-effort" principle. I can't promise miracles, let's just see how it comes out.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

After some more digging and experiments it seems realistic that the following multi-step process will get us the goals above.

I'll proceed and try to do all the above locally. If it works, I'll push a "test" repo to github so that folks can inspect. Everything takes ages. Patience.

asfimport commented 8 years ago

Upayavira (@upayavira) (migrated from JIRA)

@dweiss Just for clarity's sake - what impact will this have on existing clones/forks on Github? Would they continue to work, or break?

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

They will break (because we plan to remove JARs and binary blobs). They are only partially correct anyway (no history past Solr/Lucene merger). You should be able to rebase custom patches fairly easily though since the content of each SVN revision should be identical, only commit hashes will differ.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Everything looks good so far. I stitched Solr's and Lucene history beautifully locally. Lots of interesting plot twists on the way.

Had to restart git-svn fetches because it occurred to me that: 1) the source of git-svn cannot be my local mirror (because it'd show in commit logs); if not for anything else, then for legal reasons we should fetch from Apache's SVN directly, 2) fixing author entries is easier in git-svn (via authors.txt).

while (!successfull()) retry();

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Does anybody know scala? I'd love to filter the JAR files to zero size using https://rtyley.github.io/bfg-repo-cleaner/ but the source code is way beyond my comprehension.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Nevermind, I did it myself.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I pushed a test repo with merged history to: https://github.com/dweiss/lucene-solr-svn2git

A few remarks.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Is it still expected that there still a problem for lucene core/ history?

E.G. here is indexwriter: https://github.com/dweiss/lucene-solr-svn2git/commits/master/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java?page=8

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

git log (and github) doesn't display log history past rename.

http://stackoverflow.com/questions/5646174/github-follow-history-by-default

Try this though:

git log --follow lucene\core\src\java\org\apache\lucene\index\IndexWriter.java

Shows the history all the way back to 2001.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Dawid, i installed the chrome extension (https://chrome.google.com/webstore/detail/github-follow/agalokjhnhheienloigiaoohgmjdpned/) which seems to work.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I can go down to git repo size of 160mb by removing any of these files (not currently used on any of the active branches):

*.mem
*.dat
*.war
*.zip

These are mostly precompiled automata, etc. Current blobs (in any of branch_x and master) are not affected, but tags are. Don't know if it makes sense.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

SVN-git merging procedure (outline). For historical reference.

asfimport commented 8 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I can go down to git repo size of 160mb

Call me silly, but I'm +1 on that. Same reason as for the jars - if you want those files, they are in SVN and that is the best place to try and deal with that level. The Git repo should just try to capture all the code / build history it can.

asfimport commented 8 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

i installed the chrome extension

For cmd line, if you have Git 2.6 or above, you are supposed to be able to make --follow the default when it makes sense with something like git config --global log.follow "true"

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I'd keep those resources at least in the releases made in the past 12 months or so. It should still truncate nicely. You can play with it yourself if you wish, the instructions are attached to the issue. I'll attach the custom tool too.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Some tools used during the migration process (customized bfg).

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

I cloned from https://github.com/dweiss/lucene-solr-svn2git.git, and it works as advertised. After a git gc, the total file size is:

find . -type f -print0 | xargs -0 cat | wc 2942604 13472825 347467457

This is just under 350MB, which does not seem to be consistent with the 214MB that was mentioned above. Did I do something wrong?

To me the actual size is not a problem at all.

For reference, the total number of files in the local git repo is 9322: find . -type f | wc 9322 9324 694864

And thanks for showing how and when to graft.

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

git gui reports this:

Number of packed objects: 741540 Number of packs: 1 Disk space used by packed objects: 228602 KiB.

Sorry for the noise, the earlier counts include the working tree.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

The exact number will depend slightly on the git version used (I had 1.x on one machine and 2.x on the other). I used simple estimates in the form of

du -sh .git

on a clean clone.

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Thanks for all the hard work you put into this Dawid!

I was trying to test out how far the history goes back on the Solr side, using SearchComponent.java as an example. I tried this: git log --follow solr/core/src/java/org/apache/solr/handler/component/SearchComponent.java but it only goes back to 2012-04. But when I use other tools I'm familiar with, Atlassian SourceTree, I found early commit messages with "SearchComponent" in them revealing commit 4a490cff561e9ab492ec27fdc55c51c0db02ffed in 2007-12. Any ideas why git --log didn't work in this case?

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Look at comments above, David – the tools probably don't "follow" renames. There should be an answer in those tools' docs how to fix this behavior, the history of renames is in the repo, for sure.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Oops, sorry. I misread your comment. I don't know. will look into it tomorrow.

asfimport commented 8 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Sounds like David is saying the opposite - the other tools are following and --follow with git is not working.

@dsmiley, is your git at least 1.5.3? I think that's where it was introduced on a quick google search.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

No, it's indeed truncated. The reason for this is, like I mentioned, the fact that git doesn't really remember the exact "path" of a file (renames). It just tries its best to guess renames by moving paths of objects with the same hash.

The history of SearchComponent ends at this commit in git:

svn log -v -r 1144761 https://svn.apache.org/repos/asf/lucene/

If you look at the SVN log you'll see that this commit does both renames from a branch and changes to code; this can't be reflected in the counterpart git commit. I didn't track the exact reason why git can't follow the diff-change. Like I mentioned multiple times, it's the best effort, it's not exact history – SVN and git are different in terms how they manage file tracking. Feel free to browse the object graph though (gitk), perhaps you can improve upon this situation!

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Btw. this is a good observation, David – exactly what I was hoping for when I solicited feedback. I'll see if there's anything to be improved in the import/ conversion process, but like I said, I wouldn't be too optimistic.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

All I can say is the "continuity" of this file with respect to git log gets truncated somewhere when files have been moved from src/java to src/core/... Note that git blame does show changes this this file correctly though (or at least stretches back to Ryan's initial commit).

asfimport commented 8 years ago

Stefan Pohl (migrated from JIRA)

Would it technically be feasible to detect such rename/move + change commits and split them up into two git commits? Within git, I typically do separate commits for rename/move operations, not having to rely on git's best-effort detection of very similar files.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Technically it's how it should be done in git (a good practice to preserve history of renames). But practically no, I can't do it – it's what git-svn does by default and the prostpect of doing it by hand for the bazillion of mixed change/merges in the project's history is not an appealing one.

Perhaps you could try to fix this one particular merge somehow (git should be seeing the rename with options to detect renames harder, but it still doesn't), but I'm afraid I won't have the time to do it. Besides, this would be serious fiddling with commit history – all I did was fuse histories together, I didn't add new commits or alter existing commits. Whether we should do it just so that git log works... don't know. Look at the merge history around the problematic commit with gitk --all... no wonder git gets confused. I definitely get confused!

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I think there's still something wrong with the migration process (with respect to the newest Solr history). Too many root commits in Solr history, something is wrong – perhaps this is the source of the problem with history logging. I'll be looking into this while waiting for Santa.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks for pointing out the problem, David. The cause of the issue was Steve's rename-and-merge a long time ago... very complex, not worth mentioning. I fixed it with some manual tweaks and updated the repo (your local clone will be invalid and will contain stale refs, fetch a fresh one).

https://github.com/dweiss/lucene-solr-svn2git

The migration procedure is 100% repeatable and I can roll out an up-to-date copy any time. It looks super good to me. I did not size-optimize anything except JAR files so that releases and diffs between commits are true. I don't think it's worth the trouble; a clone from github on my machine slurps a few mb/s.

I think this issue is ready and I'm closing it.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Ready. Whenever we decide to switch, it's there.

asfimport commented 8 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Thanks Dawid, awesome job! That missing history in git made some things painful for me in the past... so glad it's fixed!

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Excellent; this is great! I tried with another old source file too and git followed it. Thanks again Dawid.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks. I've placed the scripts and know-how on how the migration process is performed here: https://github.com/dweiss/lucene-solr-svn2git-migration

The current git version of SVN at Apache is broken and cannot be reused, author tags are messed up:

> git remote -v
origin  git://git.apache.org/lucene-solr.git (fetch)
origin  git://git.apache.org/lucene-solr.git (push)
> git log --all | grep "Author: " | sort -u
...
Author: Adrien Grand <jpountz@apache.org =  jpountz = Adrien Grand jpountz@apache.org@apache.org>
Author: Adrien Grand <jpountz@apache.org>
...
Author: dsmiley <dsmiley@13f79535-47bb-0310-9956-ffa450edef68>
Author: ehatcher <ehatcher@13f79535-47bb-0310-9956-ffa450edef68>
... (and more)

I fetched everything from scratch via git-svn (see the scripts if you're interested). I also introduced a few minor synthetic commits that reshuffle folders or do some cleanups so that the repository looks more sensible. An overview of what it looks like conceptually (with revision numbers and sources) is here:

https://raw.githubusercontent.com/dweiss/lucene-solr-svn2git-migration/master/docs/dev-lines-overview.png

As mentioned previously, I also cleaned up tags and branches (moving all current branches to tags under history/*. These (and graft tags) can be deleted of course - I left them as a reference. All releases use release/(project)/(version) convention, again converted to more modern, dot-separated naming scheme (SVN tags used underscores back from CVS days).