PerchSecurity / dendrol

🌴 The STIX2 Pattern expression parser for humans
https://pypi.org/project/dendrol/
MIT License
25 stars 3 forks source link

Use AST-like classes from the stix2 package's patterns.py #3

Open theY4Kman opened 5 years ago

theY4Kman commented 5 years ago

As @chisholm pointed out and @rpiazza suggested over email, the oasis-open/cti-python-stix2 repo has a patterns.py file containing classes intended to represent Pattern expression symbols/components. It'd be nice if dendrol utilized these, to keep functionality all in the family, and encourage interoperability.

Using these classes could usher the dict-based format currently in use by PatternTrees into a purely serialization/visualization-related role.

rpiazza commented 5 years ago

@theY4Kman, @chisholm: As I was mentioning, version 1.1.0 of stix2patterns added a generic (no-op) visitor. I'm working on a visitor for python-stix2, which will convert the ANTLR AST into a STIX2 patterns AST. It should be available in the next release of python-stix2. Let me describe what I'm trying to do with this visitor.

One thing you might want to do is subclass those pattern classes so you can add some methods to do something interesting with the AST. But then the generic visitor would need to create a AST with using the subclass instances - which of course, it doesn't know about.

I used these pattern classes in the slider, as Andy (@chisholm) mentioned. The slider had subclasses for each pattern class (e.g., ComparisonExpressionForSlider), and a visitor to instantiate those classes. When I wanted to use the parser in another utility (which of course would have its own subclasses) I didn't want to have yet another visitor that did the same thing, except instantiated different subclasses.

Therefore, I implemented the python-stix2 visitor to dynamically determine which subclasses to instantiate. When you instantiate this visitor, you pass in the necessary information that enables you to create whatever subclasses you want.

In this way, there is not a lot of duplicate code.

Here is the code to do this:

class STIXPatternVisitorForSTIX2(STIXPatternVisitor):
    classes = {}

    def __init__(self, module_suffix, module_name):
        if module_suffix and module_name:
            self.module_suffix = module_suffix
            if STIXPatternVisitorForSTIX2.classes == {}:
                module = importlib.import_module(module_name)
                for k, c in inspect.getmembers(module, inspect.isclass):
                    STIXPatternVisitorForSTIX2.classes[k] = c
        else:
            self.module_suffix = None
        super(STIXPatternVisitor, self).__init__()

    def get_class(self, class_name):
        if class_name in STIXPatternVisitorForSTIX2.classes:
            return STIXPatternVisitorForSTIX2.classes[class_name]
        else:
            return None

    def instantiate(self, klass_name, *args):
        klass_to_instantiate = None
        if self.module_suffix:
            klass_to_instantiate = self.get_class(klass_name + "For" + self.module_suffix)
        if not klass_to_instantiate:
            # use the classes in python_stix2
            klass_to_instantiate = globals()[klass_name]
        return klass_to_instantiate(*args)

then use it as follows:

self.instantiate("OrObservationExpression", [children[0], children[2]])

If either of you know of a better design pattern to do this - please suggest it :-)

chisholm commented 5 years ago

Sounds like you have a preference for:

my_ast.do_something_interesting()

vs

do_something_interesting(my_ast)

In other words, you want to build novel things into the tree, as opposed to doing novel things with a generic standard tree. I think I'd probably prefer the latter, just because it seems simpler; there is no need for the module introspective voodoo magic. If you want to allow users to hang custom bits of data off AST nodes, you could have a settable user_data property on all of them.

But this is just my gut reaction; I haven't had the experience of building the slider, so perhaps there are other considerations I haven't thought of :)

rpiazza commented 5 years ago

@chisholm, I guess I'm an old-fashion OO programmer. With the classes of an AST there is a lot of ability to inherit methods. Also, whenever I seem to be calling isinstance a lot, it seems like methods make a lot more sense. But maybe you have some way to avoid this - like the delegation pattern?

chisholm commented 5 years ago

It's hard to reply well without more details... If every AST usage requires another set of subclasses, that seems like a lot of extra work.

You'd have to explain where isinstance fits in. A method would obviate the need for isinstance() only if that method were common to all relevant types. Because then you could be assured that the method exists no matter what the object type is. Are you thinking of needing isinstance checks on AST node types? Can all AST node classes be written with the same set of methods? If not, you'll still need isinstance. E.g. use of alternation in grammar rules can mean that a parent node could have one of several different types of children. It might be more natural to allow the method selection to vary over AST node classes, since the different nodes represent different types of substructure. And then it seems very likely you'll need isinstance.

As far as delegation: perhaps you are thinking that a user of an AST would want to augment it somehow, and if he doesn't subclass, he would have to create his own node classes which contain an AST node inside (in addition to whatever else he needs), and "delegate" to the inner AST node? That's a user design decision. I feel like I'd be trying to avoid replicating the AST node classes in my own app/library. There shouldn't technically be any need to augment at all; the AST is just a structure which reflects the meaning in a "sentence" (w.r.t. a grammar). It could be made immutable. But it might make life easier if you could hang some of your own data off of the nodes. That's why I thought of having a user_data property.

Anyway, this is all pretty abstract and intuitive for me. I may be totally missing the mark. And I don't have the slider experience you do. I was just giving some instinctual feedback.

theY4Kman commented 5 years ago

The AST that ANTLR spits out has a few too many details that make general usage cumbersome. Having a distinct symbol to link an observation expression with a qualifier is not something the general user cares about — they just wanna know what sorts of transformations to apply to the expression. With things like that, dendrol will "flatten" the tree to assure any node traversal by the user is semantically significant — in the case of observation expressions, it tacks the list of all associated qualifiers directly to the expression (where the first element should be applied to the expression first).

Dispatching a custom class for each symbol type wouldn't quite fit dendrol's use case — I'd imagine having the walk the tree again, anyway, to perform the simplifications.

Though, on the subject of suggestions for dispatch: a lot of digging into PyYAML went into development of dendrol; their dispatch system might be food for thought. Here's a rough example of what they do:

import yaml

class MyLoader(yaml.Loader):
    pass

def load_object_path(loader, node):
    path = loader.construct_sequence(node)
    return MyObjectPath(path)

MyLoader.add_constructor('!object_path', load_object_path)

assert MyLoader.load('!object_path ["a", "b", "c"]') == MyObjectPath(['a', 'b', 'c'])

The symbols that dendrol does retain, though, I'd like to return as some standardized symbol class — e.g. having a list including WithinQualifier, even if QualifiedObservationExpression is not used.

rpiazza commented 5 years ago

@theY4Kman,

You are probably correct, that you would walk the tree once to create the python-stix2 AST, then again to create the yaml.

On the other hand, you should be able to use everything in from stix2patterns and just subclass the visitor with your visitor. Better to rely on the standard grammar/parser than rolling your own. :-)

rpiazza commented 5 years ago

@chisholm,

I was suggesting that if you use the do_something_interesting(my_ast) pattern, you would need to call isinstance to determine what "something" is, based on the class instance.

As you indicated, using methods avoids having to do that - so I prefer the my_ast.do_something_interesting() pattern. You can probably avoid the "voodoo", but dynamically determining the class you want to instantiate is a common pattern - in more static OO languages (like, gulp, java).

To the point that every AST usage would require their own subclasses - probably - but I think that was my intention. We have no idea what will be done with the AST in the future.

chisholm commented 5 years ago

The AST that ANTLR spits out has a few too many details that make general usage cumbersome.

I think that technically speaking, ANTLR doesn't spit out an AST. It's more properly called a "parse tree". The technical definition of "parse" is to derive a parse tree, which illustrates how the sentence may be generated from the grammar. As far as I know, ASTs are intended to be simpler than parse trees for exactly the reason you stated: although parse trees give you what you need to know, they tend to be overly "verbose". Once you have a parse tree, you can probably create a simpler tree which carries the same information, e.g. by flattening it out. That can make the subsequent processing steps simpler too.

Of course with "actions", listeners, etc, you can leverage the parsing process to do your own thing and spit out whatever you want.

So the idea isn't to subclass parse tree nodes, Rich wants to subclass AST nodes. You wouldn't have to walk an AST to perform a simplication; the AST (possibly composed of instances of custom subclasses) would be the simplification. You just have to agree on the appropriate AST structure. The one comment I had is that nodes representing redundant parentheses should be removed :)