Properly document plugin results & Some more proposals for refactoring plugins

maringuu commented 2 years ago

Plugins set their result in the FileObject.processed_analysis[PLUGIN_NAME] Currently its contents are documented as follows:

        #: Analysis results for this file.
        #:
        #: Structure of results:
        #: The first level of this dict is a pair of ``'plugin_name': <result_dict>`` pairs.
        #: The result dict can have any content, but always has at least the fields:
        #:
        #: * analysis_date - float representing the time of analysis in unix time.
        #: * plugin_version - str defining the version of each plugin at time of analysis.
        #: * summary - list holding a summary of each file's result, that can be aggregated.

If I am not mistaken there are some plugins that dont adhere to this documentation. This should be verified and fixed.

As discussed with @jstucke we will change it so that a plugins writes its result to FileObject.processed_analysis[PLUGIN_NAME]["result"]. The reasoning behind this is that all keys of FileObject.processed_analysis[PLUGIN_NAME] should be the same for all plugins. In addition a plugin should document what keys it sets in result.

maringuu commented 2 years ago

Also: Many plugins plugins that depend on other plugins use their summary to parse the results. Imo summary should not be something that should be parsed but rather just a human readable string. My idea would be to properly document and define the result and document that only this should be used by dependencies. Summary can then be changed without breaking anything.

maringuu commented 2 years ago

Recap

A quick recap of the discussion that we (@jstucke @dorpvom @rhelmke ) had.

Enforcing a proper structure of the results will probably be done with json schema

The code will be refactored to have less logic in the Child classes and more logic in the PluginBae class. It will look like sth. alogn the lines the following snippet:

class PluginBase:
def _generate_result(self, file_object) -> dict:
    """Analyse the file and return a dict matchting the json schema.
    Must be implemented by the plugin"""
    raise NotImplementedError

def _generate_summary(self, result) -> str:
    """Generate a short summary based on the result returned by `generate_result`.
    Must be implemented by the plugin"""
    raise NotImplementedError

def analyze(self, file_object) -> None:
   """Run the plugin. The analysis result and metadata will be stored in `file_object.processed_analysis`.
   The structure is documented in the json schema located at /somewhere/."""
   # Do the actual analysis
   result = self._generate_result()file_object
   file_object.processed_analysis[self.NAME]["result"] = result
   file_object.processed_analysis[self.NAME]["summary"] = self._generate_summary(result)
   # Set metadata
   file_object.processed_analysis[self.NAME]["version"] = self.VERSION

Other ideas

Versioning

Enforcing a proper structure of the results will probably be done with json schema

With a properly defined schema we could use SemVer for plugin versions.

Plugin code structure

Currently the plugin code structure is like the following:

.
├── __init__.py
├── code
│   ├── __init__.py
│   └── qemu_exec.py
├── docker (Everything related to docker)
│   ├── Dockerfile
│   └── start_binary.py
├── install.py
├── routes (Some more templates and flask routes)
│   ├── __init__.py
│   ├── ajax_view.html
│   └── routes.py
├── test (Directory that contains all tests and test data)
│   ├── __init__.py
│   ├── data
│   ├── test_plugin_qemu_exec.py
│   └── test_routes.py
├── view
│   └── qemu_exec.html (Template used to display the results)
├── internal (Code that is not in the code directory)
│   └── some_code.py
│   └── some_module
│               └── nested.py

The name of the directory is something related to the plugin name as defined by AnalysisPlugin.NAME. The proposal is to change it to std like this:

├── __init__.py (Contains some code that allows us to do import plugins.analysis.qemu_exec.Plugin)
├── docker (unchanged)
│   ├── Dockerfile
│   └── start_binary.py
├── installer.py (Renamed)
├── main.py (Contains the PluginClass, which must not anymore be named AnalysisPlugin but rather sth like QemuExecPlugin)
├── routes.py
├── some_code.py
├── some_module (Now custom modules must not anymore be put inside an internal directory)
│   └── nested.py
├── templates (
└── test (Unchanged)
    ├── __init__.py
    ├── data
    └── test_plugin_qemu_exec.py

The biggest change is removing the code and internal directorys. They are currently only needed for plugin discovery. All files in the code directory are scanned for a class AnalysisPlugin and then this class is imported. This is not only bad because the classes dont have descriptive names but also unnecessarily complicated.

Also when we are doing this we could finally name the directory like the actual name of the plugin.

fkie-cad / FACT_core