decalage2 / oletools

oletools - python tools to analyze MS OLE2 files (Structured Storage, Compound File Binary Format) and MS Office documents, for malware analysis, forensics and debugging.
http://www.decalage.info/python/oletools
Other
2.92k stars 562 forks source link

Extraction of images from "Data" directory in office OLE files #457

Open benkittner opened 5 years ago

benkittner commented 5 years ago

I'm trying to write an addition to a production environment document classification application by using images extracted from office documents.

Our team has been using oletools to extract macros from files we're looking at, and at first glance it would appear as though oletools would support image extraction given that it works with Microsoft Compound files, but none of the tools seem to look inside the "Data" directory within the file where the images are held.

I was hoping that oletools could add a module that would extract all nonstandard media from office files in a way that they could be used for other tools. Another good question oletools could answer is whether a document contains embedded pictures without extracting them.

OLEimage

decalage2 commented 5 years ago

For now I do not plan to parse the internal structure of Word/Excel/PPT/etc files in oletools, as that would require a lot of work. However, if you are willing to contribute some code to do so, please do not hesitate to send me a pull request.

It looks like what you are trying to achieve is to carve image files from stream data. In that case, I can suggest to look at file carving tools such as those: https://hachoir.readthedocs.io/en/latest/subfile.html https://github.com/simsong/bulk_extractor https://github.com/sleuthkit/scalpel http://foremost.sourceforge.net/

christian-intra2net commented 5 years ago

I did start some code in direction of "let's understand the structure as office does it" with the ppt_record_parser . However, there is just so much different stuff in these files and sometimes microsoft does not adhere to its own standards (or I misread them), so pretty early I fell back to just parse the type of data needed to extract macros and ignored the rest. But it should be easily expendable (at least for ppt where everything is record-based).