decalage2 / oletools

oletools - python tools to analyze MS OLE2 files (Structured Storage, Compound File Binary Format) and MS Office documents, for malware analysis, forensics and debugging.
http://www.decalage.info/python/oletools
Other
2.92k stars 562 forks source link

olevba: use chardet when VBA source code encoding is unknown #628

Open decalage2 opened 4 years ago

decalage2 commented 4 years ago

In VBA_Project.extract_macros(), if for any reason (e.g. malformed data) it is impossible to parse the VBA project stream to obtain information about VBA modules, all streams are checked to determine if they contain a VBA module. In that case, the encoding of the VBA source code is unknown. For now, olevba uses the cp1252 encoding, because it is the most frequently used, but this could lead to decoding errors. A solution could be to use the 3rd party package "chardet" to guess the encoding. See potential implementations:

In any case, I think chardet should not be yet another mandatory dependency, so it's better to make it optional, and to fall back to cp1252 if chardet is not installed.

c-rosenberg commented 4 years ago

Maybe these commits will fit better: https://github.com/HeinleinSupport/oletools/commit/8a636acfce76dec0ef65f5145800d796fab949e3 https://github.com/HeinleinSupport/oletools/commit/1865fbda18bdfde94e3ec3c4884305f79bb1f31a https://github.com/HeinleinSupport/oletools/commit/7436ce7ff0203baf8ffcf2a7a39334f49307c75a

But your assumption is correct for now - while seeing many Emotets, I have only seen some rare files where chardet was evaluated. And then the encoding was cp1252 ;)