decalage2 / oletools

oletools - python tools to analyze MS OLE2 files (Structured Storage, Compound File Binary Format) and MS Office documents, for malware analysis, forensics and debugging.
http://www.decalage.info/python/oletools
Other
2.89k stars 565 forks source link

Recognize txt #836

Open christian-intra2net opened 9 months ago

christian-intra2net commented 9 months ago

olevba's heuristic for detecting plain text (no \x00 in the binary data) does not work with many unicode encodings like utf16. Improve on that heuristic and move it to ftguess.py, so we can at least deal with harmless text encoded with utf8, latin1, or utf16 (with or without BOMs). This is far from perfect, ignores popular Asian encodings, but according to wikipedia utf8 is by far the most popular encoding used in software. If we need something better still, I'd recommend not re-inventing the wheel here but use libmagic or other specialized libraries.

I created sample files for all the encodings used and unittests to check them.