decalage2 / oletools

oletools - python tools to analyze MS OLE2 files (Structured Storage, Compound File Binary Format) and MS Office documents, for malware analysis, forensics and debugging.
http://www.decalage.info/python/oletools
Other
2.81k stars 560 forks source link

olevba dropping file extensions #811

Closed samspoerl closed 5 months ago

samspoerl commented 1 year ago

Affected tool: olevba

Describe the bug I'm trying to extract all the VBA code from an xlsm file. I'm initializing a VBA_Parser object then calling the extract_all_macros() method. When I iterate through the returned tuples, the vba_filename values are missing file extensions. The behavior is inconsistent as sometimes the file extensions are there (e.g., .bas or .cls), and sometimes they're not. The attached sample did not have them for me.

File/Malware sample to reproduce the bug olevba_bug_no_vba_file_extensions.zip password: sample

How To Reproduce the bug

import os
from oletools.olevba import VBA_Parser

KEEP_NAME = True # Set this to True if you want to keep "Attribute VB_Name"

workbook_path = os.path.join(os.getcwd(), "test.xlsm")

def parse(workbook_path):
    vba_path = workbook_path + 'WorkbookContent'
    vba_parser = VBA_Parser(workbook_path)
    vba_modules = vba_parser.extract_all_macros() if vba_parser.detect_vba_macros() else []

    for filename, stream_path, vba_filename, content in vba_modules:
        print("filename: " + filename)
        print("stream_path: " + stream_path)
        print("vba_filename: " + vba_filename)

        lines = []
        if '\r\n' in content:
            lines = content.split('\r\n')
        else:
            lines = content.split('\n')

        if lines:
            content = []
            for line in lines:
                if line.startswith('Attribute') and 'VB_' in line:
                    if 'VB_Name' in line and KEEP_NAME:
                        content.append(line)
                else:
                    content.append(line)
            if content and content[-1] == '':
                content.pop(len(content)-1)
                non_empty_lines_of_code = len([c for c in content if c])
                if non_empty_lines_of_code > 0:
                    if not os.path.exists(os.path.join(vba_path)):
                        os.makedirs(vba_path)
                    with open(os.path.join(vba_path, vba_filename), 'w', encoding='utf-8') as f:
                        f.write('\n'.join(content))

    # Close as per recommended
    vba_parser.close()

parse(workbook_path)

Expected behavior For the vba_filename values returned from extract_all_macros() to have their original file extensions (e.g., .bas or .cls).

Console output / Screenshots image

Version information:

Additional context N/A

beauvankirk commented 12 months ago

I have observed this behavior as well and was quite confused, but I think the underlying issue must be some recent change in the way Excel writes the vbaproject.bin file. I had an old version of a .xlsm workbook on which I had been periodically running extract_vba without issue (where .bas and .cls files get created), then at some point the created files were missing their extensions. To confirm, I took an older copy of the document which when fed to extract_vba still generates .cls/.bas files, made a trivial change and resaved, and then extract_vba generated files with no extensions.

decalage2 commented 5 months ago

Fixed by PR #723