Closed thoward closed 1 year ago
Thanks for pointing this out! Indeed, I've left binary diffing unhandled all together. What sort of result would you be interested in seeing while going through binary files? E.g., for the first, maybe return the header and add a "is_binary" flag to the diff
object, with no output otherwise?
Well, for my use case, I'm doing some deep git log analysis, which needs to parse each diff and dig into the specifics of the changes. So, I want to get whatever information is available for each patch, even if I can't get a full set of changes. File mode changes, renames, binary files, etc, these are all relevant to me.
So I'd like to see the library provide its best guess at whatever info is available. In the case of binary files, at minimum it could surface a header with filenames, mode changes, etc. Since git logs can produce either a full binary diff or a stub message, I think an is_binary
flag would be appropriate and then perhaps extract the binary blob and put into a single line of 'change' (as an MVP). In the full binary patch, the data is actually a diff format, once decoded, and is probably documented in the git project somewhere. In theory, it could be parsed and a set of changes could be derived (with byte offsets in the first two field of the change tuple, instead of line numbers, and binary data in the third field).
I'm not sure how this maps to the other diff formats the library supports, so it's hard to know if the general header tuple should have varied properties depending on the format, or if is_binary
should be added to all header tuples for all patch formats.
According to this reference: https://svn.apache.org/repos/asf/subversion/branches/SVNParentPathTemplate/notes/svnpatch/svnpatch-git.txt
... the full binary format is a base85 encoded, zlib compressed data blob, which I assume is the full contents of the file (so no real "diff"). I can take a stab at trying to decode that, to see if that is true also for changes to a file, or just for creation.
Yeah, the main goal of this library is breadth. This project was created from the need to process submitted patches from any source where the patch type isn't known. Have you checked out using dulwich for your needs? I've used it for heavier, git-specific work. I totally agree, though, small binaries are checked in everywhere and should at least be handled as much as possible.
guys, I implemented support of git binary files here https://github.com/cscorley/whatthepatch/pull/6
Although #6 at least gets support for listing the change, going to leave this open for git diff --binary
.
@cscorley what exactly do you mean by:
going to leave this open for git diff --binary
can you also update the title accordingly?
@graingert, currently it only parses the headers, it does not handle full binary diffs.
I have a diff from git that looks like this:
However, this is failing it's header parsing because in
parse_git_header
(https://github.com/cscorley/whatthepatch/blob/master/whatthepatch/patch.py#L186) there is a call tofindall_regex
looking to match any line in the diff togit_header_old_line
, which matches on lines that are prefixed by---
. Since this diff doesn't have that pattern in it, it doesn't attempt to parse the rest of the diff header.While it's true that committing binary files to git is a bit of an anti-pattern, small images like this are a common thing in git repos. Git is also capable of producing binary diffs by adding the
--binary
flag, which creates something more like this:Not sure what the details of the values at
literal
are, but it seems to the the byte length of the decoded/uncompressed binary blob, and that final trailing bitHcmV?d00001
is a complete mystery. This may be documented somewhere, but it's not super obvious from a Google search.Anyhow, both of those format fail header detection. I'm currently bypassing this by detecting/converting these lines into normal multi-line git file headers... but that's a PIA. :)