cstroe / svndumpapi

A Java library for manipulating a Subversion dump file.
GNU Affero General Public License v3.0
5 stars 1 forks source link

PathCollisionValidator is not very memory efficient #49

Open ghost opened 9 years ago

ghost commented 9 years ago

PathCollisionValidator will choke quickly on a large repository, because revisionSnapshots stores a map with every full path in the repository for each revision. With many tags this becomes quite a lot quite fast, and Java runs out of memory :)

I solved that same issue (I'm doing something very similar to help me refactor a giant repository, but for work so I can't share code ;) ) by instead using a tree structure, with each node storing it's 'file' name and it's revision. The big trick is that for each new revision only the root node is replaced, the sub nodes are the nodes from the previous repository. On each add/delete/replace/modify the nodes in the path are also replaced, but no others. When adding a copy, again only the top node is new, and links to the nodes from the copied revision. This creates a structure which holds every revision completely, in a easily traverse able format, so you can perform the checks, but which is far more memory use friendly.

Hope this makes sense, hope this helps ;)

cstroe commented 9 years ago

Thank you for the feedback @cranphin. I'm glad to hear that you find this project useful.

I understood your design from the description and I think your design for PathCollisionValidator is much better than the current one. It should be refactored as you mention, with special care to test out the effect of many many tags.

cstroe commented 9 years ago

@cranphin: Here's an update on my plan. For FileContentReplace I'm using Google's Jimfs to keep track of SVN files in memory. I'm planning to adapt that to make PathCollisionValidator more efficient.

Might need some hacks to deal with the revision though.

ghost commented 8 years ago

Awesome :) One of the things I tried to do was write a program that could generate the exact include arguments for svn's svndumpfilter, to keep only the history required for a certain project in head. But I found out that's pretty much impossible for a repository with lots of history, moves, and svn copy's :) (e.g. what when two folders which are different entities in svn, have had the same path/name at some point in the history. svndumpfilters basic path matching just isn't good enough for cases like that ;)

So I guess if I want that, I will have to skip svndumpfilter entirely, and create the filtered file myself. Possibly by starting at the revesion/path I want the history for, then backtracking all sources/copies back into history, keeping a list, and then creating a dump file from that list in forward order. Not that easy XD Ran out of time for now anyways, mayby somday based on your code ;)

Cheers :)

ghost commented 8 years ago

Oh, and I like the use of Jimfs, don't know anything about it, but sounds like an interesting way to do it ;)

cstroe commented 8 years ago

Yes, that's the hard part, the fact that a file's path is dependent on the revision that it exists in. That forces us to have to track paths through revisions.

I was able to use Jimfs to get basic test cases to work, however I ran into a scenario similar to the one you speak of: A file existed at some point in the past, but doesn't exist now. Since Jimfs reflects the paths of the previous revision only, it is not a sufficient solution.

My current thinking is to create a Conditionally Linked Tree data structure. The traversal of the tree will be controlled by a condition (in our case whether that node existed at a specific revision). The tree will be maintained across revisions, and tree nodes will have a range of revisions for which they exist in the tree.

I'm sure a Conditionally Linked Tree is not a novel idea since it's just another way of implementing custom tree traversal, but I have not been able find to an implementation that puts the conditions in the tree nodes themselves.

ghost commented 8 years ago

Well, I'm mostly where I want to be with my app :) For 'storing' files I chose a pretty space efficient method, In memory I just store the position and length in the original dump file, and then use a RandomAccessFile to get the files when I need them, based on those :) Ofcourse this only works as long as the original file doesn't change. Before that I used a DerbyDb embedded instance, that works too, just use some simple SQL and blobs :)

For marking and storing all changes needed to keep the history for a certain path at a certain revision, I just keep a list of each revision and each change, and then walk backwards through both the changes and revisions. I then keep a list of 'roots' I'm interested in (each only contains a revision and path), each copy adds a new root (or multiple if some roots fall underneath the copied to path), each add deletes a root (since it didn't exist in the repo before the add), and each change matching a root is tagged to be included in the output. The hard parts is in the details, I actually keep two kinds of roots, ones which we want the subnodes of, and ones that are only parent paths (so the parent paths are also included :) ). I'll add a snippet of just the Analyzer, if you want to do something like that, it'll help a lot ;) analyzer.txt