dsuni / svndumpsanitizer

A program aspiring to be a more advanced version of svndumpfilter
https://miria.homelinuxserver.org/svndumpsanitizer/
GNU General Public License v3.0
48 stars 15 forks source link

Corrupted file contents #5

Closed trentfisher closed 10 years ago

trentfisher commented 10 years ago

I was splitting a large repository, but the resulting dump file wouldn't load due to a checksum error. I looked at the original dump file and the resulting dump file, and found that the file contents in the latter were truncated. The file was binary, but not very big (489 bytes)

I've been going through the code, but have yet to find the source of this bug. I tried to increase the values of cur_max and INCREMENT in case it was a buffer length problem, but no luck.

What's worse, is that it only occurs in situ. If I extract that one file, put it into a new repository, and run svndumpsanitizer, it works properly.

If anybody has any suggestions for tracking this down, I'm all ears.

dsuni commented 10 years ago

It's very hard to say what the issue could be with just this information. By how much was the file truncated?

The only value that should directly affect file truncation is CONTENT_PADDING, but changing that is not likely to produce desirable results. (Or to be more specific, it will mess up normal operations, that actually work as expected.)

trentfisher commented 10 years ago

The file should be 489 bytes, the truncated one is missing the last 53 bytes. The first two bytes in the missing section are nulls, but there are numerous nulls in the included section (so nulls are not the problem). My current hunch is that printf (at line 825) is messing with the binary data, so I changed the code to use fwrite instead, but that mangled things worse. I'm setting it aside for today.

dsuni commented 10 years ago

Odd indeed, but the file contents should not be written by the printf@825, but rather by the fputc@792. Is the Content-length value correct in the dump file? That value is parsed on line 788, and an incorrect con_len value (plus the constant CONTENT_PADDING - which is 2) are the only ones that in my opinion could cause file truncation, and the constant is clearly not to blame.

If it's the con_len value, the question is "Why this specific file?", and the only reason I can come up with (without seeing the file) is that the dump file could have an incorrect value for this particular revision/node.

trentfisher commented 10 years ago

You're right, I was looking at the wrong part of the code. I looked at the code in question and I think I found the problem: the "Content-length:" header was not the last one. I had run the dump through a (homegrown) filter which inadvertently changed the header order. I redid the dump (without my filter) and re-ran and it worked far better (though it still failed for a different reason which I am now researching)

The interesting thing to note is that I use the same filter quite often, and have not had any problems, so I think that assuming "Content-length" is at the end of the header block is not, strictly speaking, valid. But if you want to close this, or change it to an "enhancement request", that's fine.

BTW, thanks for writing this tool!

dsuni commented 10 years ago

I figured there had to be something about the dump file... As for the parsing, I'm not sure about exactly how svn does it. It surprises me a bit to find out that not having "Content-length:" last would work.

I wonder what would happen if one added a text file containing an appropriate excerpt of the contents of another svn dump to the repo, and didn't have "Content-length:" last. I thought about that when writing svndumpsanitizer, and concluded that the only way I could tell "bogus" info like that from real info was by assuming that the N bytes following "Content-length: N" was data rather than metadata.

I think I'll close this one for now. I might look into this more later if I can find the time.