cscorley / whatthepatch

What The Patch!? -- A Python patch parsing library
MIT License
64 stars 24 forks source link

Parse full patches using git diff --binary #4

Closed thoward closed 1 year ago

thoward commented 8 years ago

I have a diff from git that looks like this:

diff --git a/foo.png b/foo.png
new file mode 100644
index 0000000..f91f164
Binary files /dev/null and b/foo.png differ

However, this is failing it's header parsing because in parse_git_header (https://github.com/cscorley/whatthepatch/blob/master/whatthepatch/patch.py#L186) there is a call to findall_regex looking to match any line in the diff to git_header_old_line, which matches on lines that are prefixed by ---. Since this diff doesn't have that pattern in it, it doesn't attempt to parse the rest of the diff header.

While it's true that committing binary files to git is a bit of an anti-pattern, small images like this are a common thing in git repos. Git is also capable of producing binary diffs by adding the --binary flag, which creates something more like this:

diff --git a/foo.png b/foo.png
new file mode 100644
index 0000000000000000000000000000000000000000..bac462fbda4b111ad571e7d8d4326e2e5f0c9365
GIT binary patch
literal 89791
<snip: tons of encoded binary data>
literal 0
HcmV?d00001

Not sure what the details of the values at literal are, but it seems to the the byte length of the decoded/uncompressed binary blob, and that final trailing bit HcmV?d00001 is a complete mystery. This may be documented somewhere, but it's not super obvious from a Google search.

Anyhow, both of those format fail header detection. I'm currently bypassing this by detecting/converting these lines into normal multi-line git file headers... but that's a PIA. :)

cscorley commented 8 years ago

Thanks for pointing this out! Indeed, I've left binary diffing unhandled all together. What sort of result would you be interested in seeing while going through binary files? E.g., for the first, maybe return the header and add a "is_binary" flag to the diff object, with no output otherwise?

thoward commented 8 years ago

Well, for my use case, I'm doing some deep git log analysis, which needs to parse each diff and dig into the specifics of the changes. So, I want to get whatever information is available for each patch, even if I can't get a full set of changes. File mode changes, renames, binary files, etc, these are all relevant to me.

So I'd like to see the library provide its best guess at whatever info is available. In the case of binary files, at minimum it could surface a header with filenames, mode changes, etc. Since git logs can produce either a full binary diff or a stub message, I think an is_binary flag would be appropriate and then perhaps extract the binary blob and put into a single line of 'change' (as an MVP). In the full binary patch, the data is actually a diff format, once decoded, and is probably documented in the git project somewhere. In theory, it could be parsed and a set of changes could be derived (with byte offsets in the first two field of the change tuple, instead of line numbers, and binary data in the third field).

I'm not sure how this maps to the other diff formats the library supports, so it's hard to know if the general header tuple should have varied properties depending on the format, or if is_binary should be added to all header tuples for all patch formats.

thoward commented 8 years ago

According to this reference: https://svn.apache.org/repos/asf/subversion/branches/SVNParentPathTemplate/notes/svnpatch/svnpatch-git.txt

... the full binary format is a base85 encoded, zlib compressed data blob, which I assume is the full contents of the file (so no real "diff"). I can take a stab at trying to decode that, to see if that is true also for changes to a file, or just for creation.

cscorley commented 8 years ago

Yeah, the main goal of this library is breadth. This project was created from the need to process submitted patches from any source where the patch type isn't known. Have you checked out using dulwich for your needs? I've used it for heavier, git-specific work. I totally agree, though, small binaries are checked in everywhere and should at least be handled as much as possible.

ramusus commented 8 years ago

guys, I implemented support of git binary files here https://github.com/cscorley/whatthepatch/pull/6

cscorley commented 8 years ago

Although #6 at least gets support for listing the change, going to leave this open for git diff --binary.

graingert commented 7 years ago

@cscorley what exactly do you mean by:

going to leave this open for git diff --binary

can you also update the title accordingly?

cscorley commented 5 years ago

@graingert, currently it only parses the headers, it does not handle full binary diffs.