RFC: Python extensions for auto-analysis

subreption-research commented 1 month ago

Currently, if we are not mistaken, there isn't a mechanism to dynamically load or plug Python extensions into the auto-analysis process. Extensions are required to be written in Java, which is not necessarily a problem but it is a maintenance and end-user burden sometimes.

Ideally, we would like to see (or contribute to) a standardized API/mechanism that can load Python extensions providing auto-analysis fucntionality, with their own settings integrated in the existent configuration handling, and the possibility of adding widgets or UI elements programmtically.

This could be done through static variables and callbacks, with no direct widget-related calls from the Python side (for example, an extension might define N tabs populated through a dictionary, each with settings that are assigned to an unique ID and can be translated to settings that can be saved "as is"), removing the complexity of bridging widget/UI control.

The initial design could be as simple as providing the following callbacks:

initialize: the extension is in a functional state, all dependencies are met. It will be listed accordingly in the auto-analysis UI. A 'ready' attribute can also be provided in the base class.
priority: this might require more forward-thinking, but in essence, you would provide a numeric "weight" that tells Ghidra when the specific extension should be invoked during auto-analysis.
process: the extension performs whichever analysis/processing it is supposed to do.
abort: for any possible use of coroutines or background tasks.
cleanup
finished

In our case this idea was floated by one of our developers related to #6781.

The reason for not limiting such extensions to a script or similar is mostly related to the additional steps in running them, and the fact that the scripting capabilities seem more like a feature to allow for small ad-hoc operations, and have grown to be a relatively disorganized repository of one-off solutions. This might be debatable, but it can be argued that more seamless integration will open the path to better integration of more complex tooling. Ultimately, it's a quality of life issue.

ryanmkurtz commented 1 month ago

We have had some discussions about making all ExtensionPoints "scriptable", so you can distribute an analyzer/loader/filesytem source file instead of distributing a heavy-weight prebuilt extension. These are just initial discussions though...no work as been planned yet. But, that's the level from which we'd likely want to tackle the problem from, so more than just analyzer's would benefit. Ideally this would also work as python source too.

subreption-research commented 1 month ago

We just recently completed the YARA analyzer (in Java) and it seems doable to explore the options in PyGhidra for creating a "fabric" between Analyzers and the Python-side. The main issue with Java extensions is the maintenance burden of any dependencies, especially native ones (since we need to build against OS X, Linux and Windows).

A realistic first milestone could be writing the loader and event handler to support the methods for Analyzer classes. We will look into this when time permits. Supporting the core functionality isn't too daunting but handling corner cases properly might be (for example cancelling the Analyzer gracefully).

It would be helpful to put together more documentation for the new PyGhidra capabilities.

Will comment on #6781 for the Yara extension progress meanwhile.

astrelsky commented 1 month ago

Maybe this is what you're looking for?

https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Features/PyGhidra/src/main/py#registering-an-entry-point

subreption-research commented 1 month ago

Maybe this is what you're looking for?

https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Features/PyGhidra/src/main/py#registering-an-entry-point

This seems to be limited to Java extensions/external code, what we would like to have is an entire layer Python-side that integrates seamlessly into the Analyzer process, so that we can write Analyzer classes in Python handling the methods there (options, added, ended, etc), with no functional differences versus a compiled Analyzer extension. This would also immediately expose the OS libraries and Python modules, making things easier in the long-run.

astrelsky commented 1 month ago

Looks like the ClassSearcher functionality would need to be "extendable" such that Python can locate and provide instances of the requested ExtensionPoint interfaces. You can't instantiate a proxy class in Java so I think the getClasses methods would be unusable outside of the Java case.

astrelsky commented 1 month ago

Probably have to do something like this. I whipped this up in about 30 minutes, so it's probably full of flaws.

import importlib
import pkgutil
import typing

import jpype

from java.lang import UnsupportedOperationException

_ExtensionPoints = dict()

def load_subpackages(monitor, pkg):
    for subpkg in pkgutil.iter_modules(pkg.__path__):
        monitor.checkCancelled()
        if subpkg.ispkg:
            importlib.import_module(subpkg, pkg)

def ExtensionPoint(extension: typing.Union[jpype.JClass, str]):

    def wrapper(cls):
        nonlocal extension
        cls = jpype.JImplements(extension)
        # only add it if it succeeds
        if not isinstance(extension, jpype.JClass):
            extension = jpype.JClass(extension)
        # should be a collection sorted by priority
        extensions = _ExtensionPoints.get(extension, set())
        extensions.add(cls)
        _ExtensionPoints[extension] = extensions
        return cls

    return wrapper

# this isn't an interface, I'm pretending it is to present the idea
@jpype.JImplements("ghidra.util.classfinder.ClassSearcher")
class ClassSearcher:

    @jpype.JOverride
    def search(monitor):
        # not as efficient as the Java searcher because we have to load the modules
        for entry in importlib.metadata.entry_points(group='pyghidra.extension_points'):
            monitor.checkCancelled()
            try:
                # load all packages and subpackages
                # use of the ExtensionPoint decorator will register them accordingly
                load_subpackages(monitor, entry.load())
            except Exception as e:
                # log in Ghidra log
                pass

    @jpype.JOverride
    def getClasses(*args):
        raise UnsupportedOperationException()

    @jpype.JOverride
    def getInstances(extension):
        if not isinstance(extension, jpype.JClass):
            extension = jpype.JClass(extension)
        return [cls() for cls in _ExtensionPoints.get(extension, [])]

NationalSecurityAgency / ghidra

RFC: Python extensions for auto-analysis #6917