Piping of input data - Githubissues

dpetzen commented 11 years ago

Hello Daniel

Let me first thank you for an excellent tool. It has helped me resolve several really tricky data operations on our Subversion repositories that simply didn't work with svndumpfilter. Great work and thanks for your efforts!

So, I've spent some time looking into your code and had a few feeble attempts to allow svndumpsanitizer to read from stdin, but even though I managed to create wrapper for fseek(), the rewind() call on line 662 (v1.0.2) is a bit more tricky.

I've looked at how svndumpfilter solves it, but the code to handle stdin is buried deeply inside the APR library.

We have huge repositories where we need to compress input and output data to avoid running out of disk space, so this would be a brilliant new functionality for us.

It would be great if you would consider looking into this. I'm happy to help as much as I can. I know my way around in C, even though I may be a bit rusty. I've tried quite a few things to buffer stdin that I'm happy to share with you if you think it helps.

Regards, Daniel Petzen

dsuni commented 11 years ago

This is something I gave a lot of thought right when I started to design this software. I agree that being able to read from stdin would be a nice feature, but if there is a way to do it, I sure haven't figured it out*.

The reason is that svndumpsanitizer needs the data twice, because it basically works like this: 1) Read & store metadata in memory. 2) Analyze metadata and decide wich parts to keep, and which to discard. 3) Read data again, this time writing the parts that should be kept to the outfile

How does svndumpfilter handle this? In two words: It doesn't. That's why it (in my experience) works so poorly. The problem with filtering out parts of a subversion dumpfile is that you either need to predict (=guess) the future (svndumpfilter approach) or read the data twice, and thus know the future, because you've already seen it (svndumpsanitizer approach). It should come as no surprise that knowing works better than guessing.

A brief example may serve to illustrate this. Picture the following repository: Revision 0: (Default empty revision) Revision 1: Directory "topsecret" added. Files "nsa_prism.txt" and file "nothing_to_see_here_move_along.txt" are added under topsecret. Revision 2: Directory "full_disclosure" added, and file "think_of_the_children.txt" added under it. Revision 3: File nothing_to_see_here_move_along.txt is moved to directory full_disclosure.

Now let's assume we want to filter this repository and only keep the stuff in the full_disclosure directory. We do this first with svndumpfilter, and then svndumpsanitizer.

Svndumpfilter does this: Revision 0: (Default empty revision) Revision 1: Empty revision. (After all we only want to keep the full_disclosure dir) Revision 2: Directory "full_disclosure" added, and file "think_of_the_children.txt" added under it. Revision 3: Tries to move the file nothing_to_see_here_move_along.txt but craps out with an error message, because it doesn't exist. (Svndumpfilter made an incorrect guess about keeping it in revision 1.)

Svndumpsanitizer does this: Revision 0: (Default empty revision) Revision 1: Directory "topsecret" added. File "nothing_to_see_here_move_along.txt" is added under topsecret. (These are kept, because svndumpsanitizer has "seen the future" and knows they will be needed.) Revision 2: Directory "full_disclosure" added, and file "think_of_the_children.txt" added under it. Revision 3: File nothing_to_see_here_move_along.txt is moved to directory full_disclosure. Revision 4: Directory topsecret is deleted.

I think you can see the problem by now... The only way I could think of getting around this would be adding an option that would first store the analyzed metadata in a file, and then the option of sanitizing the stream based on the content of that file... You would also need to feed the data to svndumpsanitizer twice, so it's not a very pretty solution.

dpetzen commented 11 years ago

Yes, that makes perfect sense. I though that was the case. Thanks for taking the time to explain it in detail.

I've pondered it for a bit and my old idea, your idea and a new idea would be:

Buffer stdin The only obvious option here is to read all of stdin into memory, but it partially defeats the purpose, as it'll be the full uncompressed size that is read into memory (not just a memory mapped file), which will cause a lot of disk activity if it's 150GiB.
Save the metadata (as you mentioned) I had a quick look and apart from being a bit cumbersome as you say, writing the "revisions" struct to disk will require logic to read a binary struct file back into memory, which feels like a lot of work for something that feels a bit like a quick-and-dirty fix.
Add compression I had a look at the zlib to see if it would be easy to add support for compressed files. I must admit that the documentation and example code was scary. In addition to this, it'll also make the wonderfully clean and independent svndumpsanitizer code reliant on externals libs etc.

I think the only "proper" solution is native compression support. Does any of these solutions appeal to you? I'm happy to help out, even though I'm a bit time limited as I have quite a few support tasks to juggle at the same time right now.

dsuni commented 11 years ago

Truthfully none of these options really appeal to me. :-(

Buffering is only useful for small repositories. But if your repository is small, then you probably have enough space to not need the feature at all. A big repository probably simply couldn't be cached. (Some people have effing HUGE repositories. The biggest one I've heard about someone using svndumpsanitizer on was >700GiB.)
We both agree that this is a rather hackish and not very pretty approach.
Well, you already mentioned 2 reasons for not liking it. Added complexity, and library dependency. Doing the compression natively would also be a lot of work.

One thing that could be done with relative ease, would be to comment out the lines that print the progress, and make svndumpsanitizer write to stdout instead of the outfile. That way you could compress the output simply by piping it through gzip. This could even be permanently implemented relatively cleanly & easily, but would of course only solve half of the problem (i.e. the output).

Would it be unacceptable in this case to "just throw hardware at the problem"? A 3TB drive doesn't cost a fortune these days. Or is this one of those cases where the server is on another continent, and the only way to get storage added is to go through 2 layers of management, and 2 additional layers of outsourced barely sentient "technical" staff who're still trying to figure out how it's possible that the server runs when it doesn't have Windows on it? ;-)

dpetzen commented 11 years ago

I wrote a long comment that I just noticed had disappeared. I need to dash now, but I'll follow up tomorrow morning!

dpetzen commented 11 years ago

Hi Daniel

I'm sorry, but it got really busy, both at work as well as at home. I do appreciate your quick and elaborate response, so I was hoping to be a bit quicker with my responses than I've been so far.

So, to try to recreate my lost comment; it went something like this:

The idea of writing to stdout is actually quite interesting. A suggestion would be to direct the progress output to stderr, as it's quite nice and useful. That would allow you to pipe stdout to a file and still have the progress information.

Writing to stdout would also allow you to pipe the data directly into "svnadmin load ...", which would save a bit of space.

I'm happy to make these changes and send you a patch, but I'm guessing you'd rather do it yourself than having someone messing with your code.

In regards to disk space; it's the good old story with enterprise storage, where the infrastructure is huge and have quite sophisticated redundancy etc. That said, the SAN chaps here are brilliant and really flexible, so it's normally not too much of a problem. The bottleneck right now is to motivate loads of storage in our staging environment where we always carry out the testing prior to performing the change on production.

dsuni commented 11 years ago

I've implemented this feature in version 1.2.0.

dpetzen commented 11 years ago

Brilliant, thanks!

I've downloaded it, compiled it and tested it. It looks fine.

I'm getting our support chaps to use this new version of the large repo to see if it makes it easier for them.

I really appreciate your help with this.

On another note, I've got to try out Bashtris. I've done a few mad things in Bash myself, but that is way beyond anything I've done.

On yet another note, I saw that "Swedish and Finnish" had been added to your nonogram-qt project changelog. I'm Swedish myself, so I can't help getting curious about things relating to the Swedish language etc.

dsuni / svndumpsanitizer

Piping of input data #1