Closed asfimport closed 8 years ago
Mark Miller (@markrmiller) (migrated from JIRA)
For some reference, here is a wiki page managing Mavens migration to git: https://cwiki.apache.org/confluence/display/MAVEN/Git+Migration
Here is one of the infra JIRA's: https://issues.apache.org/jira/browse/INFRA-5266 Migrate Maven subprojects to git (surefire,scm,wagon)
Not all very relatable to us in a lot of ways, but a root to get into INFRA tickets for a past migration.
Dawid Weiss (@dweiss) (migrated from JIRA)
Thanks Mark. I don't think I'll use automated scripts, I'll most likely put together something that will translate raw history revision-by-revision (cleaning up the dump local SVN first). It can take a long time if it's a one-time conversion. I realize it's mind-bending, but let's see if it works. I'll need some time to work through it, these are huge files.
Dawid Weiss (@dweiss) (migrated from JIRA)
Also, renames-with-modifications may not be picked up correctly for linear file history, like this one (r894959
):
Node-path: lucene/solr/branches/cloud/src/java/org/apache/solr/cloud/CountdownWatcher.java
Node-kind: file
Node-action: add
Node-copyfrom-rev: 892824
Node-copyfrom-path: lucene/solr/branches/cloud/src/java/org/apache/solr/util/zookeeper/CountdownWatcher.java
Text-delta: true
Text-delta-base-md5: ba60152c2bd0eebe18755e4c555a62d2
Text-content-length: 924
Text-content-md5: f7173659406b1d7e07ee024cbaf78506
Content-length: 924
(diff of changes)
Mark Miller (@markrmiller) (migrated from JIRA)
Most of these issues are not unique to us though right? I assume that a lot of projects that have converted from svn to git have had to just bite the bullet on these issues?
Is it easy to show how widespread the issues are? (eg how many classes are affected)
Dawid Weiss (@dweiss) (migrated from JIRA)
I don't know, to be honest. I think most conversions are done automatically using git-svn and if there are any erroneous merges they're never detected. Note this doesn't mean the final files are incorrect – they will be the same, it's just the history that is different. git doesn't "track" paths, it tracks changes. So you can have two (slightly different) files that will be detected as a "copy-with-change" operation.
I think the whole conversion should be based on the "best-effort" principle. I can't promise miracles, let's just see how it comes out.
Dawid Weiss (@dweiss) (migrated from JIRA)
After some more digging and experiments it seems realistic that the following multi-step process will get us the goals above.
git-svn
to mirror (separately) lucene/java/*
, lucene/dev/*
and Solr's pre-merge history.I'll proceed and try to do all the above locally. If it works, I'll push a "test" repo to github so that folks can inspect. Everything takes ages. Patience.
Upayavira (@upayavira) (migrated from JIRA)
@dweiss Just for clarity's sake - what impact will this have on existing clones/forks on Github? Would they continue to work, or break?
Dawid Weiss (@dweiss) (migrated from JIRA)
They will break (because we plan to remove JARs and binary blobs). They are only partially correct anyway (no history past Solr/Lucene merger). You should be able to rebase custom patches fairly easily though since the content of each SVN revision should be identical, only commit hashes will differ.
Dawid Weiss (@dweiss) (migrated from JIRA)
Everything looks good so far. I stitched Solr's and Lucene history beautifully locally. Lots of interesting plot twists on the way.
Had to restart git-svn fetches because it occurred to me that: 1) the source of git-svn cannot be my local mirror (because it'd show in commit logs); if not for anything else, then for legal reasons we should fetch from Apache's SVN directly, 2) fixing author entries is easier in git-svn (via authors.txt).
while (!successfull()) retry();
Dawid Weiss (@dweiss) (migrated from JIRA)
Does anybody know scala? I'd love to filter the JAR files to zero size using https://rtyley.github.io/bfg-repo-cleaner/ but the source code is way beyond my comprehension.
Dawid Weiss (@dweiss) (migrated from JIRA)
Nevermind, I did it myself.
Dawid Weiss (@dweiss) (migrated from JIRA)
I pushed a test repo with merged history to: https://github.com/dweiss/lucene-solr-svn2git
A few remarks.
branch_3x
, branch_4x
and branch_5x
as active branches. trunk
becomes master
.master
's history is not entirely up to date; we can fill in remaining commits by fast-forwarding the remaining commits manually if we switch to git.historical/branches/*
, invoke git tag
to see the list of tags.releases/lucene,solr,lucene-solr/number
. Previous "tags" from SVN are available under historical tags (see above).grafts/*
.git config core.filemode false
to ignore.git log --follow lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
.gitk --all
makes a very interesting reading.Robert Muir (@rmuir) (migrated from JIRA)
Is it still expected that there still a problem for lucene core/ history?
E.G. here is indexwriter: https://github.com/dweiss/lucene-solr-svn2git/commits/master/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java?page=8
Dawid Weiss (@dweiss) (migrated from JIRA)
git log (and github) doesn't display log history past rename.
http://stackoverflow.com/questions/5646174/github-follow-history-by-default
Try this though:
git log --follow lucene\core\src\java\org\apache\lucene\index\IndexWriter.java
Shows the history all the way back to 2001.
Robert Muir (@rmuir) (migrated from JIRA)
Thanks Dawid, i installed the chrome extension (https://chrome.google.com/webstore/detail/github-follow/agalokjhnhheienloigiaoohgmjdpned/) which seems to work.
Dawid Weiss (@dweiss) (migrated from JIRA)
I can go down to git repo size of 160mb by removing any of these files (not currently used on any of the active branches):
*.mem
*.dat
*.war
*.zip
These are mostly precompiled automata, etc. Current blobs (in any of branch_x and master) are not affected, but tags are. Don't know if it makes sense.
Dawid Weiss (@dweiss) (migrated from JIRA)
SVN-git merging procedure (outline). For historical reference.
Mark Miller (@markrmiller) (migrated from JIRA)
I can go down to git repo size of 160mb
Call me silly, but I'm +1 on that. Same reason as for the jars - if you want those files, they are in SVN and that is the best place to try and deal with that level. The Git repo should just try to capture all the code / build history it can.
Mark Miller (@markrmiller) (migrated from JIRA)
i installed the chrome extension
For cmd line, if you have Git 2.6 or above, you are supposed to be able to make --follow the default when it makes sense with something like git config --global log.follow "true"
Dawid Weiss (@dweiss) (migrated from JIRA)
I'd keep those resources at least in the releases made in the past 12 months or so. It should still truncate nicely. You can play with it yourself if you wish, the instructions are attached to the issue. I'll attach the custom tool too.
Dawid Weiss (@dweiss) (migrated from JIRA)
Some tools used during the migration process (customized bfg).
Paul Elschot (migrated from JIRA)
I cloned from https://github.com/dweiss/lucene-solr-svn2git.git, and it works as advertised. After a git gc, the total file size is:
find . -type f -print0 | xargs -0 cat | wc 2942604 13472825 347467457
This is just under 350MB, which does not seem to be consistent with the 214MB that was mentioned above. Did I do something wrong?
To me the actual size is not a problem at all.
For reference, the total number of files in the local git repo is 9322: find . -type f | wc 9322 9324 694864
And thanks for showing how and when to graft.
Paul Elschot (migrated from JIRA)
git gui reports this:
Number of packed objects: 741540 Number of packs: 1 Disk space used by packed objects: 228602 KiB.
Sorry for the noise, the earlier counts include the working tree.
Dawid Weiss (@dweiss) (migrated from JIRA)
The exact number will depend slightly on the git version used (I had 1.x on one machine and 2.x on the other). I used simple estimates in the form of
du -sh .git
on a clean clone.
David Smiley (@dsmiley) (migrated from JIRA)
Thanks for all the hard work you put into this Dawid!
I was trying to test out how far the history goes back on the Solr side, using SearchComponent.java as an example. I tried this:
git log --follow solr/core/src/java/org/apache/solr/handler/component/SearchComponent.java
but it only goes back to 2012-04. But when I use other tools I'm familiar with, Atlassian SourceTree, I found early commit messages with "SearchComponent" in them revealing commit 4a490cff561e9ab492ec27fdc55c51c0db02ffed in 2007-12. Any ideas why git --log didn't work in this case?
Dawid Weiss (@dweiss) (migrated from JIRA)
Look at comments above, David – the tools probably don't "follow" renames. There should be an answer in those tools' docs how to fix this behavior, the history of renames is in the repo, for sure.
Dawid Weiss (@dweiss) (migrated from JIRA)
Oops, sorry. I misread your comment. I don't know. will look into it tomorrow.
Mark Miller (@markrmiller) (migrated from JIRA)
Sounds like David is saying the opposite - the other tools are following and --follow with git is not working.
@dsmiley, is your git at least 1.5.3? I think that's where it was introduced on a quick google search.
Dawid Weiss (@dweiss) (migrated from JIRA)
No, it's indeed truncated. The reason for this is, like I mentioned, the fact that git doesn't really remember the exact "path" of a file (renames). It just tries its best to guess renames by moving paths of objects with the same hash.
The history of SearchComponent ends at this commit in git:
svn log -v -r 1144761 https://svn.apache.org/repos/asf/lucene/
If you look at the SVN log you'll see that this commit does both renames from a branch and changes to code; this can't be reflected in the counterpart git commit. I didn't track the exact reason why git can't follow the diff-change. Like I mentioned multiple times, it's the best effort, it's not exact history – SVN and git are different in terms how they manage file tracking. Feel free to browse the object graph though (gitk), perhaps you can improve upon this situation!
Dawid Weiss (@dweiss) (migrated from JIRA)
Btw. this is a good observation, David – exactly what I was hoping for when I solicited feedback. I'll see if there's anything to be improved in the import/ conversion process, but like I said, I wouldn't be too optimistic.
Dawid Weiss (@dweiss) (migrated from JIRA)
All I can say is the "continuity" of this file with respect to git log gets truncated somewhere when files have been moved from src/java to src/core/... Note that git blame does show changes this this file correctly though (or at least stretches back to Ryan's initial commit).
Stefan Pohl (migrated from JIRA)
Would it technically be feasible to detect such rename/move + change commits and split them up into two git commits? Within git, I typically do separate commits for rename/move operations, not having to rely on git's best-effort detection of very similar files.
Dawid Weiss (@dweiss) (migrated from JIRA)
Technically it's how it should be done in git (a good practice to preserve history of renames). But practically no, I can't do it – it's what git-svn does by default and the prostpect of doing it by hand for the bazillion of mixed change/merges in the project's history is not an appealing one.
Perhaps you could try to fix this one particular merge somehow (git should be seeing the rename with options to detect renames harder, but it still doesn't), but I'm afraid I won't have the time to do it. Besides, this would be serious fiddling with commit history – all I did was fuse histories together, I didn't add new commits or alter existing commits. Whether we should do it just so that git log works... don't know. Look at the merge history around the problematic commit with gitk --all
... no wonder git gets confused. I definitely get confused!
Dawid Weiss (@dweiss) (migrated from JIRA)
I think there's still something wrong with the migration process (with respect to the newest Solr history). Too many root commits in Solr history, something is wrong – perhaps this is the source of the problem with history logging. I'll be looking into this while waiting for Santa.
Dawid Weiss (@dweiss) (migrated from JIRA)
Thanks for pointing out the problem, David. The cause of the issue was Steve's rename-and-merge a long time ago... very complex, not worth mentioning. I fixed it with some manual tweaks and updated the repo (your local clone will be invalid and will contain stale refs, fetch a fresh one).
https://github.com/dweiss/lucene-solr-svn2git
The migration procedure is 100% repeatable and I can roll out an up-to-date copy any time. It looks super good to me. I did not size-optimize anything except JAR files so that releases and diffs between commits are true. I don't think it's worth the trouble; a clone from github on my machine slurps a few mb/s.
I think this issue is ready and I'm closing it.
Dawid Weiss (@dweiss) (migrated from JIRA)
Ready. Whenever we decide to switch, it's there.
Yonik Seeley (@yonik) (migrated from JIRA)
Thanks Dawid, awesome job! That missing history in git made some things painful for me in the past... so glad it's fixed!
David Smiley (@dsmiley) (migrated from JIRA)
Excellent; this is great! I tried with another old source file too and git followed it. Thanks again Dawid.
Dawid Weiss (@dweiss) (migrated from JIRA)
Thanks. I've placed the scripts and know-how on how the migration process is performed here: https://github.com/dweiss/lucene-solr-svn2git-migration
The current git version of SVN at Apache is broken and cannot be reused, author tags are messed up:
> git remote -v
origin git://git.apache.org/lucene-solr.git (fetch)
origin git://git.apache.org/lucene-solr.git (push)
> git log --all | grep "Author: " | sort -u
...
Author: Adrien Grand <jpountz@apache.org = jpountz = Adrien Grand jpountz@apache.org@apache.org>
Author: Adrien Grand <jpountz@apache.org>
...
Author: dsmiley <dsmiley@13f79535-47bb-0310-9956-ffa450edef68>
Author: ehatcher <ehatcher@13f79535-47bb-0310-9956-ffa450edef68>
... (and more)
I fetched everything from scratch via git-svn (see the scripts if you're interested). I also introduced a few minor synthetic commits that reshuffle folders or do some cleanups so that the repository looks more sensible. An overview of what it looks like conceptually (with revision numbers and sources) is here:
As mentioned previously, I also cleaned up tags and branches (moving all current branches to tags under history/*
. These (and graft tags) can be deleted of course - I left them as a reference. All releases use release/(project)/(version)
convention, again converted to more modern, dot-separated naming scheme (SVN tags used underscores back from CVS days).
Goals:
lucene/site
lucene/nutch
lucene/lucy
lucene/tika
lucene/hadoop
lucene/mahout
lucene/pylucene
lucene/lucene.net
lucene/old_versioned_docs
lucene/openrelevance
lucene/board-reports
lucene/java/site
lucene/java/nightly
lucene/dev/nightly
lucene/dev/lucene2878
lucene/sandbox/luke
lucene/solr/nightly
lucene/java
lucene/solr
lucene/dev/trunk
lucene/dev/branches/branch_3x
lucene/dev/branches/branch_4x
lucene/dev/branches/branch_5x
Non goals:
Impossible:
r1569975
) and merges from multiple branches in one commit (r940806
).Migrated from LUCENE-6933 by Dawid Weiss (@dweiss), resolved Dec 29 2015 Attachments: multibranch-commits.log Linked issues:
7996
7995