Wilfred / difftastic

a structural diff that understands syntax 🟥🟩
https://difftastic.wilfred.me.uk/
MIT License
19.94k stars 317 forks source link

Binary files being displayed as lines of text #687

Closed blackfawn closed 3 months ago

blackfawn commented 3 months ago

Some binary files, which happen to be shared objects in LFS, are displayed as "Text" instead of "Binary" with difftastic. This makes using difftastic very difficult as it will show endless lines of binary as changed lines of text. Running file -i $FILE on these files returns application/x-sharedlib; charset=binary so they are being properly identified.

I see difftastic only flags PDFs and ZIPs as binary within the application/* MIME/media types: https://github.com/Wilfred/difftastic/blob/master/src/files.rs#L162-L167

I believe a large majority of application/ types would be expected to be binary files, so perhaps it might make sense to invert the application/ logic and treat anything application/* as binary that isn't specifically application/json or application/ld+json (and perhaps a few others)

Alternatively, the first 4 bytes could be checked and anything that is 0x7f 0x45 0x4c 0x46 (the "magic number" present at the start of an ELF header) could at least be treated as binary to take care of ELF files.

difft --version
Difftastic 0.56.1 (built with rustc 1.70.0)
Wilfred commented 3 months ago

Yeah, being stricter about application/ seems like a reasonable heuristic.

Difftastic also tries to guess whether a file looks like valid text by trying to decode it as UTF-8 and UTF-16, which should catch cases like these. Could you share a sample file, or set the environment variable export DFT_LOG=trace and report what it says?

Wilfred commented 3 months ago

This was probably fixed in c6da85759c0c3acb0c57abf62e7ffae4621d878a, but feel free to reopen if it reoccurs.