danielplohmann / smda

SMDA is a minimalist recursive disassembler library that is optimized for accurate Control Flow Graph (CFG) recovery from memory dumps.
BSD 2-Clause "Simplified" License
228 stars 36 forks source link

Exception when parsing Delphi structs #44

Closed danielplohmann closed 1 year ago

danielplohmann commented 1 year ago

When trying to parse Delphi structs, processing may fail due to exceptions involving negative offsets

Example file: 62f2adbc73cbdde282ae3749aa63c2bc9c5ded8888f23160801db2db851cde8f Trace:

  File "smda/Disassembler.py", line 57, in disassembleFile
    smda_report = self._disassemble(binary_info, timeout=self.config.TIMEOUT)
  File "smda/Disassembler.py", line 109, in _disassemble
    self.disassembly = self.disassembler.analyzeBuffer(binary_info, self._callbackAnalysisTimeout)
  File "smda/intel/IntelDisassembler.py", line 443, in analyzeBuffer
    self.fc_manager.init(self.disassembly)
  File "smda/intel/FunctionCandidateManager.py", line 46, in init
    self.disassembly.language = self.lang_analyzer.identify()
  File "smda/intel/LanguageAnalyzer.py", line 222, in identify
    t_objects = self.getDelphiObjects()
  File "smda/intel/LanguageAnalyzer.py", line 164, in getDelphiObjects
    data.seek(method_table - image_base)
ValueError: negative seek value -4194260
malwarefrank commented 1 year ago

I just ran into this problem when testing out mcrit with an old DarkComet RAT builder sample. I changed the conditionals in that code block (in getDelphiObjects) to check VARIABLE - image_base instead of just VARIABLE, but even then it ran (for 45 minutes on my laptop), found many candidates, but reported zero functions.

You may be able to leverage code from NCCGroup's Pythia, which appears to find functions relatively quickly on my sample. Just make sure you are looking at the parsing-dev branch.

danielplohmann commented 1 year ago

Thanks for pointing out Pythia! I remember when @danielenders1 was doing his Master thesis (during which he build SMDA's Delphi parser), he also looked at Pythia and quite possibly already adopted some of its internals for parsing, on top of all the format reversing and own analysis code he did. I'll have another look if it can be further improved upon.

For the concrete issue, I just had a look and it seems sufficient to ensure that method_table - image_base and interface_table - image_base and dynamic_table - image_base are positive values, at least for my "trouble binary" listed above that worked immediately. Comparing directly, I now get 98% of the exact function entry points found by IDA, but then as usual a good couple more due to SMDA's more aggressive function definition approach (many of them tiny anyway) - a fix for this has been just pushed with SMDA 1.12.7.

Now with your binary, it's quite a bit larger looking at the pure file size and I'm aware that SMDA's performance drastically deteriorates with the amount of possible functions / code to look at - so it might well be that it takes many minutes for completion of the disassembly. I just threw SMDA at it to check, and it now seems to no longer crash but also still takes long.

I also found out that this is primarily an issue with the internal queue used to prioritize function entry point candidates (FEP) (@yankovs pointed this out to me as well, thx!) and I hope to get that addressed at some point to improve processing speed. Based on the results from my PhD thesis, I would assume that just having unsorted buckets for different quality categories of FEP candidates would achieve about the same results, so I'm going to try that out when I find the time to port data set from my thesis to use for regressions here.

Which brings me to one final point: if you notice similar issues with SMDA not finishing on large binaries when e.g. IDA does, there is export.py which can be used to dump the IDB into a SMDA json file, which can then be loaded into MCRIT, e.g. with the CLI using python -m mcrit client submit -s <your_smda_file>. :wink:

danielplohmann commented 1 year ago

Alright, so I had a deeper look at Pythia and did some tests. Based on that, I decided that for now (i.e. smda-1.13.7) I will enable the existing Delphi struct parsing only for binaries <5MB, which should work for most binaries of interest. Since the struct parsing is now disabled for your DarkComet builder, SMDA suddenly finishes in ~70sec on my machine, still produces very decent function entry point overlap with IDA. Only the direct symbol extraction in SMDA is now missing but I hope that may come back when I possibly will adopt Pythia's parsing approach.