Cypselus commented 10 years ago

Hi, got troubles sanitizing one of repositories (57 000 revisions), about 50 paths in "include" parameter.

sanitized repository dump works pretty well: $ time ./svndumpsanitizer --infile stripped --outfile sripped.sanitized -d --include trunk ...

Step 7/7: Adding revision deleting surplus nodes... NOT NEEDED real 1m49.972s user 1m48.283s sys 0m0.540s but original full dump is being proceed for about 37 hours with 100% CPU usage Step 1/7: Reading the infile... OK Step 2/7: Removing unwanted nodes... It's not the biggest repository I've already filtered: 7.2G dump 455M stripped 95M stripped.sanitized Have not found direct contact emal, unfortunatelly. Please contact me if you are interested in any additional information. Looking forward to, Evgeny

dsuni commented 10 years ago

Hmm... Not sure if I'm quite following what the problem is here.

The "NOT NEEDED" message isn't an error, it just means that step 7 was not performed because it wasn't needed. This particular step may or may not be needed depending on repository structure. Svndumpsanitizer will itself determine whether it's needed or not, and inform the user of whether it was performed or not.

As for the time it takes, that will vary a lot depending on repo size, number of parameters provided and overall complexity of the repository. For more info/documentation you can visit the project home page: http://miria.homelinuxserver.org/svndumpsanitizer

Cypselus commented 10 years ago

I understand it, I've done sanitizing for dozens repositories before the issue. The problem is that I got stuck on sanitizing of one of repositories - It's been working for 1915 minutes now and still on the second step.

On the other hand I can filter it with dumpstrip and sanitize the same repository in 2 minutes.

It seems for me that the problem is not in a structure of files or revisions. Maybe it's misinterpretation of some user data in original dump, maybe it incorrectly takes some user data as meta.

Unfortunatelly I'm not grounded in C enough to add debug messages to your code.

dsuni commented 10 years ago

Ok... That sounds brutal indeed. I've never had to run svndumpsanitizer myself with that many parameters. Step 2 involves some nested for loops, so it's not surprising that adding a lot of parameters causes a performance hit, but for it to take over 24 hours... basically I'm not sure whether that's normal or not, because I've never had to do anything that extreme. :-)

If you want to know whether it's gotten stuck or not, the simplest way is to add a print statement in the outermost loop. You're using includes, so inserting this between lines 518 & 519 would do the trick (i.e. give you a revision progress countdown to 0):

fprintf(messages, "%d\n", i); fflush(messages);

That way you can at least see whether progress is happening, or not. Logically, it should because they are for loops with simple integer incrementation or decrementation, so I don't see how it could possibly get stuck in an infinite loop there. There is a while loop inside, but barring a string of infinite length, I don't see how that could get stuck either.

Dumpstrip can't really compare. It doesn't actually do any sanitizing. It just strips all the data out, leaving the metadata. This is always a linear operation O(n), and will not take particularly long even on a monster repo.

Cypselus commented 10 years ago

Ok, the at the begining... it's been still working. 3577 minutes.

I added the the strings and ran a new process. As I can now see that it takes a minute for some revisions, for instance - It spent about a minute to filter revisions 48859, 48837, 48809, 48762, 48747 and less than a second for the others between. So, we have a progress, but no estimations. :) It's better than nothing.

And the other story. About stripdump. I see that my previous explanations were not clear. I thought that dumpstrip makes exacltly the same file(revision) structure, but without user data. Having thought that I stripped my repository dump and started svndumpsanitizer with the stripped dump as a source. I got a result in 1 minute 49 seconds. That's why I started that thread.

Today I made exactly the same for a test. Ran the patched svndumpfilter with a stripped repository dump as a source and got signifficant difference in progress with the non-stripped file. I see that Step 2 started from 3105th revision, Insted of 52159th in non-stripped.

Step 1/7: Reading the infile... OK Step 2/7: Removing unwanted nodes... 3105 3104

It's interesting that in a result we have more than 3105 revisions.

grep -c Revision-number *

dump:52159 stripped:52159 stripped.sanitized:32887

dsuni commented 9 years ago

Nothin on this for a couple of months. Assuming case closed.

dsuni / svndumpsanitizer

dumpsanitizer struck #7

grep -c Revision-number *