dod-cyber-crime-center / pyhidra

Pyhidra is a Python library that provides direct access to the Ghidra API within a native CPython interpreter using jpype.
Other
182 stars 16 forks source link

handle python mod conflicts #18

Closed clearbluejar closed 2 years ago

clearbluejar commented 2 years ago

When trying to port over this script using Ghidra pdb Java package to Python leveraging pyhidra, I realized an issue with a python stdlib name pdb (python debugger) conflicting with Ghidra's base Java package pdb.

When I tried to import pdb I received the python stdlib module:

import pdb
print(pdb.__file__)
"/usr/local/lib/python3.10/pdb.py"

It ends up using a different standard frozen_importlib_external.SourceFileLoader when imported:

spec = importlib.util.find_spec("pdb")
print(spec)
ModuleSpec(name='pdb', loader=<_frozen_importlib_external.SourceFileLoader object at 0xffff5c946920>, origin='/usr/local/lib/python3.10/pdb.py')

instead of the desired jpype loader for the module.

ModuleSpec(name='pdb', loader=<jpype.imports._JImportLoader object at 0xffff892dfd90>)

This is actually a standard issue with jpype.

Python always imports local directories as modules before calling the Java importer.

This is because the jpype module spec finder is the last in the list for sys.meta_path after it is appended using https://github.com/jpype-project/jpype/blob/f6ad8f4e5c6ece1f5a224de420968fa508f109fa/jpype/imports.py#L224-L225:

# Install hooks into python importlib
sys.meta_path.append(_JImportLoader())
print(sys.meta_path)
[<class '_frozen_importlib.BuiltinImporter'>, <class '_frozen_importlib.FrozenImporter'>, <class '_frozen_importlib_external.PathFinder'>, <pkg_resources.extern.VendorImporter object at 0xffff9c4f6920>, <jpype.imports._JImportLoader object at 0xffff9beb1540>]

There are several ways to go about handling the issue. The developer could simply give priority to the jpype finder by modifying the sys.meta_path order, giving precedence to _JImportLoader:

Option 1:

sys.meta_path.reverse()
import pdb
sys.meta_path.reverse()  # restore when done 

Option2:

sys.meta_path.insert(0,_JImportLoader()) # set it as first entry

The problem here is that this would simply reverse the problem. Then standard python modules that a dev might want to access wouldn't be available by the standard import <mod>.

I thought maybe a class method to ensure the load of a Java module might work. You could force the load of a module using the jpype loader by adding this class to the base class of launcher.

@staticmethod
def load_java_mod_from_jpype(modname) ->  importlib.machinery.ModuleSpec:
    """
    Load a module using jpype loader ignoring find order in sys.meta_env"
    """        
    module = None 

    # Find spec using Jpype loader
    spec = importlib.util.spec_from_loader(modname,imports._JImportLoader())
    if spec is not None:
        spec._jname = modname
        module = importlib.util.module_from_spec(spec)
        # load module in sys.modules
        sys.modules[modname] = module
        spec.loader.exec_module(module)

    return module

The issue I see with this is that you would still need to call this method for each package you wanted to ensure used a Java package.

pdb = launcher.load_java_mod_from_jpype("pdb")

pdb.method(arg,arg1)

or just call it first and then call import as the mod would be loaded in sys.modules

launcher.load_java_mod_from_jpype("pdb")  
import pdb

pdb.method(arg,arg1)

After seeing how jpype "handles" the issue, I think it might be easier to follow their lead. They simply wrap conflicting keywords to avoid the conflict.

I derived the base Java package names from sorting all the top level directories from ghidra source:

# find /ghidra/ -name *src.zip -exec sh -c "zipinfo -1 {}" \; | grep -v "/." | sort | uniq -c | grep '/'

GHIDRA_BASE_JAVA_PACKAGES = ["SWIG", "agent", "db", "decompiler", "docking", "foundation", "functioncalls", "generic", "ghidra", "help", "log", "mdemangler", "org", "pdb", "resources", "util", "utilities", "utility"]

We can register all the base Ghidra base Java package names with a name that won't conflict. Simply append a '_' to all Ghidra base Java packages names in the initialization.

pyhidra/launcher.py

This would allow the developer to do the following and know that they are importing the correct package:

import pdb_ as pdb

The changes for the PR add a method inside the base launcher.start that would initialize each base class with the jpype.imports.registerDomain. From that point on, all the base classes are available to to developer without conflicts. The developer can still use the original import <non conflicting base class> but will need to use the import mod_ as mod for conflicting packages. This should somewhat future proof as python and Ghidra Java packages change.

Might want to add something to the README to account for this. Something like:

### Handling Python module and Java Package naming conflicts

There is an issue when you want to import a Java package whose name conflicts with a Python standard module. This is  is actually a [standard issue](https://jpype.readthedocs.io/en/latest/userguide.html#importing-java-classes) with `jpype`.

When modules name conflict:

> Python always imports local directories as modules before calling the Java importer. [jpype](https://jpype.readthedocs.io/en/latest/userguide.html#importing-java-classes)

To over come a conflicting module, `pyhidra` simply registers the module with a '\_' so that it is available for import without conflict. 

`import pdb_ as pdb`

This will ensure that the Java package is loaded instead of the standard lib python module.
dc3-tsd commented 2 years ago

Thank you for the detailed breakdown of this issue, and the pull request. Right now, we are concerned with this being a hard-coded list instead of something that will automatically handle any java libraries the user has added. For the time being we will need to hold off on this pull request, but we appreciate you bringing this to our attention as it should be addressed.

clearbluejar commented 2 years ago

Yeah that would be better to dynamically determine the packages loaded by the Ghidra. When I looked, I couldn't see easy access to list all the top level classes loaded besides manually looking at the classpath used to start the JVM. I can't see how you can list them from the jpype API. There is a call to _jpype.isPackage you can make, but this just tells you whether or not the package was loaded. That is why I settled with the static list. Perhaps there is a dynamic way...

clearbluejar commented 2 years ago

I've done some homework to enable finding dynamic packages instead of the static list.

It now dynamically finds the all the packages loaded with get_runtime_top_level_java_packages.

It seems to find them all:

{'utilities', 'org', 'log', 'resources', 'util', 'net', 'SWIG', 'docking', 'edu', 'java', 'jdk', 'agent', 'com', 'sun', 'mdemangler', 'functioncalls', 'foundation', 'decompiler', 'db', 'javax', 'help', 'utility', 'pdb', 'ghidra', 'generic'}

I had to add the call to _wrap_runtime_top_level_java_packages after the call to Application.initializeApplication(self.layout, config). Otherwise, some of the classes loaded at runtime were not being found.

I added to both HeadlessPyhidraLauncher and DeferredPyhidraLauncher, but couldn't figure out where it would go for GuiPyhidraLauncher. Where does the initialization happen in that case?

One thing I noticed is that the dev will need to use the mod name for all the imports. If you use `import mod as modit will work, but thefrom mod.submod import funcbreaks down. If the dev sticks withmod_throughout, it all works the same ajpype` wrapping. I added a test to demonstrate a from import.

I think this might cut it. I'm anxious to be a contributor, so if something isn't quite right I'd be happy to fix. Or if you would like to modify my submission please feel free. At any rate, this project is amazing. Thank you for it.

dc3-tsd commented 2 years ago

We think it should be possible to allow this to keep close to the standard python importlib logic by creating a class extending Loader and MetaPathFinder.

Only the find_spec and create_module methods need to be implemented. The exec_module method must also be defined but you can just pass or return None. An instance of this class should be appended to sys.meta_path in PyhidraLauncher.start just before launching Ghidra. By appending it to the meta_path the existing python and jpype import machinery will take precedence.

Implementing find_spec should be rather straightforward. When implementing create_module you'll want to manually create a jpype.JPackage, add it sys.modules and return the created JPackage. This will allow the module to be imported and used with the trailing underscored while handing control of the module back to jpype.

We greatly appreciate your contributions and would appreciate it if you have the time to implement it. We may not get back to you for review until Monday.

clearbluejar commented 2 years ago

This will allow the module to be imported and used with the trailing underscored while handing control of the module back to jpype.

I think the current implementation might be doing exactly as you ask, without adding a custom loader. In essence, the code now leverages the jpype import machinery and simply wraps every runtime base Java package with a '_'. It takes advantage of the imports.registerDomain aliasing functionality by wrapping each base Java package (whether they conflict or not) and making them available with an '_'.

All non-conflicting base classes can be used as normal, but they are also now available with an '_'.


import ghidra  # works
import ghidra_ # also works, but is unnecessary

# conflicting classes also work 
# the convention for the dev will just need to be to use the mod_ for conflicts
import pdb_
from pdb_ import PdbParser
from pdb_.symbolserver import HttpServer

I could write a custom loader, but I think it would be difficult to best jpype's _JImportLoader and handle all the edge cases.

class _JImportLoader:
    """ (internal) Finder hook for importlib. """

    def find_spec(self, name, path, target=None):
        # If jvm is not started then we just check against the TLDs
        if not _jpype.isStarted():
            base = name.partition('.')[0]
            if not base in _JDOMAINS:
                return None
            raise ImportError("Attempt to create Java package '%s' without jvm" % name)

        # Check for aliases
        if name in _JDOMAINS:
            jname = _JDOMAINS[name]
            if not _jpype.isPackage(jname):
                raise ImportError("Java package '%s' not found, requested by alias '%s'" % (jname, name))
            ms = _ModuleSpec(name, self)
            ms._jname = jname
            return ms

        # Check if it is a TLD
        parts = name.rpartition('.')

        # Use the parent module to simplify name mangling
        if not parts[1] and _jpype.isPackage(parts[2]):
            ms = _ModuleSpec(name, self)
            ms._jname = name
            return ms

        if not parts[1] and not _jpype.isPackage(parts[0]):
            return None

        base = sys.modules.get(parts[0], None)
        if not base or not isinstance(base, _jpype._JPackage):
            return None

        # Support for external modules in java tree
        name = unwrap(name)
        for customizer in _CUSTOMIZERS:
            if customizer.canCustomize(name):
                return customizer.getSpec(name)

        # Using isPackage eliminates need for registering tlds
        if not hasattr(base, parts[2]):
            # If the base is a Java package and it wasn't found in the
            # package using getAttr, then we need to emit an error
            # so we produce a meaningful diagnositic.
            try:
                # Use forname because it give better diagnostics
                cls = _jpype._java_lang_Class.forName(name, True, _jpype.JPypeClassLoader)

                # This code only is hit if an error was not thrown
                if cls.getModifiers() & 1 == 0:
                    raise ImportError("Class `%s` is not public" % name)
                raise ImportError("Class `%s` was found but was not expected" % name)
            # Not found is acceptable
            except Exception as ex:
                raise ImportError("Failed to import '%s'" % name) from ex

        # Import the java module
        return _ModuleSpec(name, self)

    """ (internal) Loader hook for importlib. """

    def create_module(self, spec):
        if spec.parent == "":
            return _jpype._JPackage(spec._jname)
        parts = spec.name.rsplit('.', 1)
        rc = getattr(sys.modules[spec.parent], parts[1])

        # Install the handler
        rc._handler = _JExceptionHandler
        return rc

    def exec_module(self, fullname):
        pass

If you still would like a custom loader. How did you want it to be different from _JImportLoader? Did you want to try to put the base package lookup within the loader and only alias the base packages? Were there other improvements?

If not, I wanted to get your opinion on how the current implementation might benefit from a decorator around the that around the _launch method that would call the _wrap_runtime_top_level_java_packages

https://github.com/dod-cyber-crime-center/pyhidra/blob/ab188d789e149aba5690e91730608f42095d0445/pyhidra/launcher.py#L165-L168

That way the code would call _wrap_runtime_top_level_java_packages only after the subclass calls it's own _launch. I think it is possible for extended classes to inherit a decorator. Also would appreciate any feedback on the naming, classmethods vs static, or basic python wisdom.

Let me know what your thoughts and I will take another shot once I'm clear on the task.

dc3-tsd commented 2 years ago

One of the key reasons we think it will be better to switch to a loader is that by adding it using sys.meta_path.append it will always run after jpype's loader. As such the new loader won't have to duplicate any of this logic. This will free it up to only run against items with a conflict, which are provided by Python directly so there won't be a need for additional sets or loops to keep track of items.

We expect the end result will require a lot less code and be easier to maintain as a result.

clearbluejar commented 2 years ago

Ah, I think I see it now. That does seem a bit simpler.

So we will assume the dev will still use the '_' for conflicting packages.

For a quick list of requirements for a pyhidra loader:

Quick version will be like:

class _PyhidraImportLoader:

    def find_spec(self, name, path, target=None):

        if  name.endswith('_') and _jpype.isPackage(jname):
            ms = _ModuleSpec(name, self)
            ms._jname = jname
        return ms

        def create_module(self, spec):
            if spec.parent == "":
                return _jpype._JPackage(spec._jname)
            parts = spec.name.rsplit('.', 1)
            rc = getattr(sys.modules[spec.parent], parts[1])

            # Install the handler
            rc._handler = _JExceptionHandler
            return rc

        def exec_module(self, fullname):
            pass

I will start on an implementation and see how it fairs in testing. Let me know if I'm on the right track and I'll see if I can get it in this weekend.

dc3-tsd commented 2 years ago

It sounds like you're on a good path. We look forward to reviewing the updates on Monday.

Thrameos commented 2 years ago

Does the jpype alias function not cover this? You should be able to add an alias to jpype such that it looks up Java pdb to something more name complient like gov.pdb or gov.ghidra.pdb using the register funcion.

clearbluejar commented 2 years ago

I added a _PyhidraImportLoader and agree this way is quite clean and efficient. I added test cases for every mode of import I could think of and ensured that the new loader would handle all the base packages correctly. Let me know if it needs any other changes.

clearbluejar commented 2 years ago

Does the jpype alias function not cover this? You should be able to add an alias to jpype such that it looks up Java pdb to something more name complient like gov.pdb or gov.ghidra.pdb using the register funcion.

The registerDomain function definitely handles the case, but then we would have to make the call for every package, like I did originally. https://github.com/clearbluejar/pyhidra/commit/e48968e8e5344ed63dcaa18386065d4d4251a581#diff-41d9da554fbcd3b0cd1bd1be57cb50000748fbe459ac4dcf7bed6794c41554fcR159-R160

The custom loader just makes the classes available without having to register them at all, or have wait until the Ghidra application is fully initialized.

clearbluejar commented 2 years ago

I made the review changes. Should I rebase and squash the commits?

That last force push was needed to set the author.