Making entropy work with python3

ReFirmLabs / binwalk

Firmware Analysis Tool

MIT License

10.54k stars 1.51k forks source link

Making entropy work with python3 #550

Open DMaroo opened 3 years ago

DMaroo commented 3 years ago

Firstly, importing numpy is necessary (see #543) for binwalk to be able to draw a graph.

Secondly, using bytes2str doesn't work with python3 (v3.9), since it returns an object with the type unicode_type (a type recognized by numba), and there is no overload for np.frombuffer(unicode_type, dtype). So, we need to convert data to a UTF-8 string before passing it to the _shannon_numpy function. This way, no exceptions or errors are thrown.

This fixes #543. Tested on Kali 2021.1 and Python3.9.2.

spinkham commented 3 years ago

Hmm.. This doesn't seem right, but haven't looked deep enough to know why. When I run this on a random file generated from /dev/urandom, I get a steady ~0.78 entropy instead of the expected ~0.9999 I see on other platforms.

dd if=/dev/urandom of=rand.img bs=1M count=2048
binwalk -E  rand.img 

DECIMAL       HEXADECIMAL     ENTROPY
--------------------------------------------------------------------------------
0             0x0             Falling entropy edge (0.781577)

A zero filled file does give the expected 0.00000 entropy with this patch.

If I change line 253 from encoding UTF-8 to latin-1 I get the expected values from both zero and random data, and values from that modified version match those I get from other versions on a variety of test files. I haven't read the code and am not even sure why this conversion is happening, so that's just a shot in the dark but with that change it Works On My Machine.

DMaroo commented 3 years ago

You're right. The result between encoding to 'utf-8' and 'latin-1' is different. I also checked it for some zipped files and ELFs. Th shape is almost preserved, however, the exact values of the peaks and the points on the graph are different. Given that 'latin-1' gives accurate results for the random file created from /dev/urandom, I guess it would be the more correct approach. Thank you. As a side note, both the encodings give 0.0 as the entropy measure when used on a file created from /dev/zero, but I think that it is just a special case which is why both the encodings converge at the same answer.

lindberg commented 1 year ago

This is still an issue, would be great if this was merged.