cython / cython

The most widely used Python to C compiler
https://cython.org
Apache License 2.0
9.45k stars 1.49k forks source link

Expose Parser/AST Internals #1628

Open alexbw opened 7 years ago

alexbw commented 7 years ago

Hi all, I'm trying to root around and see if I can re-use the parser and intermediate-representation (perhaps an AST?) that the Cython project uses for a separate project of mine. I've been able to find several wiki pages proposing changes to the system (ref1, ref2), but no up-to-date (read: 2016) summary of the system.

It's been awhile since I've used Cython, but I remember really enjoying the commented-out Python code printed above each block of generated C code, it greatly helped with debugging. The Python ast module doesn't support such a thing, although I believe the lib2to3 module does (although it seems a bit more heavy-weight).

cython-notifications commented 7 years ago

Cython's internal AST isn't really for public consumption... Nothing on this front has changed much since those links (other than the grammar being checked in, but it's still not used by default).

On Sun, Mar 12, 2017 at 6:43 AM, Alex Wiltschko notifications@github.com wrote:

Hi all, I'm trying to root around and see if I can re-use the parser and intermediate-representation (perhaps an AST?) that the Cython project uses for a separate project of mine. I've been able to find several wiki pages proposing changes to the system ( ref1 https://github.com/cython/cython/wiki/enhancements-Parser, ref2 https://github.com/cython/cython/wiki/enhancements-Ast), but no up-to-date (read: 2016) summary of the system.

It's been awhile since I've used Cython, but I remember really enjoying the commented-out Python code printed above each block of generated C code, it greatly helped with debugging. The Python ast module doesn't support such a thing, although I believe the lib2to3 module does (although it seems a bit more heavy-weight).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cython/cython/issues/1628, or mute the thread https://github.com/notifications/unsubscribe-auth/ATqe4s317hhRkgNn_SHEzSqTQyx5ycd1ks5rk_aMgaJpZM4MaivP .

-- You received this message because you are subscribed to the Google Groups "cython-github-notifications" group. To unsubscribe from this group and stop receiving emails from it, send an email to cython-github-notifications+unsubscribe@googlegroups.com. To post to this group, send email to cython-github-notifications@ googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/ msgid/cython-github-notifications/cython/cython/issues/1628%40github.com https://groups.google.com/d/msgid/cython-github-notifications/cython/cython/issues/1628%40github.com?utm_medium=email&utm_source=footer . For more options, visit https://groups.google.com/d/optout.

leezu commented 5 years ago

Has there been any update since mid 2017? Such Parser may be helpful for code-formatters. See https://github.com/ambv/black/issues/359 https://github.com/google/yapf/issues/175

robertwb commented 5 years ago

No, there hasn't been any update. The justification to base this on CPython's grammar was to easily see the diff from Python itself.

On Thu, Nov 8, 2018 at 7:09 AM Leonard Lausen notifications@github.com wrote:

Has there been any update since mid 2017? Such Parser may be helpful for code-formatters. See ambv/black#359 https://github.com/ambv/black/issues/359 google/yapf#175 https://github.com/google/yapf/issues/175

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cython/cython/issues/1628#issuecomment-436884221, or mute the thread https://github.com/notifications/unsubscribe-auth/AAdqgW6K7IA75ptZAXeuG1FpkbfEm2yOks5us8qTgaJpZM4MaivP .

scoder commented 5 years ago

Also note that the grammar wasn't updated with the latest syntax enhancements, especially the new features in Py3.5/6/7.

scoder commented 5 years ago

I'm not sure I want to commit to a stable, public AST based on our current parse tree since it might hinder us in modifying it in the future. Maintaining an official grammar, however, might already be a helpful step for some projects.

jbrockmendel commented 5 years ago

Moving the off-topic part of #2825 conversation here.

TL;DR: I think to accomplish this goal there needs to be a non-trivial refactoring.


An Attempt to Summarize The Status Quo

For discussion purposes, I'm going to focus on cythonize as called from within a setup.py module. The mental model I had for cythonize before actually looking at the code was that it operated in roughly four stages:

1) Use user-passed information and file-system lookups to build a CompilationOptions object and collect all the relevant source files. 2) Parse the cython source files into some sort of internal representation, call it a Pseudo-AST (pAST) 3) Apply optimizations to the pAST. 4) Convert the pAST into C code.

The goal for this Issue as I understand it is to expose the pAST at the end of Step2 (or 3). So two questions:

(NB: I've spent enough time trying to understand the code to have opinions about it, but not well-informed opinions. Take all statements below with an implicit "If I understand correctly...")

You have to squint and tilt your head a bit to see the correspondence between the code and these four steps.

Step 1 would include most of Build.Cythonize, Build.Dependencies, parts of Compiler.Main, and all of Compiler.Options, Compiler.DebugFlags, Compiler.CmdLine. Also Tempita I guess.

Step 2 would include Plex and Compiler.Scanning (possibly Errors and Future for the sake of the dependency graph). There are other Compiler modules that might belong here conceptually (e.g. Parsing), but I have a really hard time telling because of the dependencies within Compiler.

Step 3 I'm pretty sure would include Pipeline and Optimize. Everything else I can't really tell where Step 3 stops and Step 4 starts.


Brainstorming A GamePlan

I claim that it is possible to refactor the code such that Step 1 is self-contained and returns a CompilationOptions object. The remainder of Cython.Build would then have an entrypoint that just takes a CompilationOptions object as its sole input.

From there I'm not sure what to do. The first thing Step 2 would do is create a Context object from the passed CompilationOptions. After that I haven't tracked down how it gets from there to Scanning and Plex.

scoder commented 5 years ago

You might be interested in https://github.com/cython/cython/wiki/DevDocs

The main entry point for 2. is the parser in Parsing.py, which also creates and executes the scanner. The parser (usually) returns a ModuleNode, which is the main syntax tree node for a module. The syntax tree is then further processed (tranformed) in multiple traversals in the following steps, which modify and specialise it. See Pipeline.py. In the last step, the tree is traversed to generate C code from it. You can enable the "verbose pipeline" mode to see the different steps during a compiler run.

What this ticket essentially wants is to expose the parse tree with a stable API. The word "stable" is the problem here, because so far, the parse tree always was an internal implementation detail of Cython that is subject to change at any time.

jbrockmendel commented 5 years ago

You might be interested in [...]

Thanks. I read through the ppt slides and skimmed some of the other pages in the wiki. It seems like separation of stages and contributor learning curve is a recurring topic.

The main entry point for 2. is the parser in Parsing.py, which also creates and executes the scanner. The parser (usually) returns a ModuleNode

That's helpful, thank you. It may be the case that an API that lets me get back a ModuleNode will be good enough for my purposes. @alexbw what do you think?

@scoder would you be open to the refactor of Step 1 mentioned above? I think simplification of the dependency structure would soften the learning curve for newcomers. Both the learning curve and the refactor are mentioned in the Wiki you linked. The general idea of making the separation between Steps 2 and 3 explicit seems to be similar to CEP 901, CEP 905

scoder commented 5 years ago

CEP 901 is essentially how things work today, using a multi-step pipeline with transformations that use the visitor pattern. You can chain pipeline steps to get whatever outcome you want (although many steps depend on previous ones – and before you ask, yes, for good reason: it's a pipeline). Note that the Wiki pages are much older than the current implementation.

Also look at the TreeFragment module, it's a way to run a user defined pipeline against a code string. We use it in a couple of places (also in tests) to compile and integrate code fragments.

I agree that most of the modules in Cython.Build do not exactly shine from their design. They mostly work and get their job done. Dependencies implements most of cythonize(), for example, which is weird. Still, most of that isn't really important for the problem at hand. If all you want is a parse tree, then look at Main.py and its parse() function. That module is the original entry point of the cython frontend tool.

The newer cythonize() implementation is written in a way that mostly avoids touching the original code, which IMHO is the wrong trade-off. There is quite a bit of obfuscation going on when it comes to finding files, also relative to the current source file or in the Python path. Lots of jumping back and forth between methods. If you want to try a refactoring, that's a good target. But it's not in this ticket.

To get to know the parse tree it's probably best to follow the HackerGuide, enable the debug_trace_code_generation debug option, and then look at the C code that Cython generates.

fabioz commented 4 years ago

As a note, after dabbling a bit on the cython code, I'm using

from Cython.Compiler.TreeFragment import parse_from_strings
mod = parse_from_strings(name, source)

to obtain the AST and it seems it's working properly for me (as a note, I started with TreeFragment but went for parse_from_strings because with TreeFragment changes the source code removing empty lines, which means that lines don't match with that API).

MarcoGorelli commented 1 year ago

Thanks @fabioz - I'm also using parse_from_strings in https://github.com/MarcoGorelli/cython-lint and it seems to be working fine

The hardest thing is not having something like ast.walk, so I'm having to do

https://github.com/MarcoGorelli/cython-lint/blob/d03e211dcfa9aa45cc0d343f78702932c2fdae9b/cython_lint.py#L621-L646

        child_attrs = set(copy.deepcopy(node.child_attrs))

        if isinstance(node, CClassDefNode):
            child_attrs.update(['bases', 'decorators'])
        elif isinstance(node, TypecastNode):
            child_attrs.add('base_type')
        elif isinstance(node, GeneratorExpressionNode):
            if hasattr(node, 'loop'):
                child_attrs.add('loop')
            else:  # pragma: no cover
                err_msg(node, 'GeneratorExpressionNode with loop attribute')
        elif isinstance(node, CFuncDefNode):
            child_attrs.add('decorators')
        elif isinstance(node, FusedTypeNode):
            child_attrs.add('types')
        elif isinstance(node, ForInStatNode):
            child_attrs.add('target')
        elif isinstance(node, NewExprNode):
            child_attrs.add('cppclass')

        for attr in child_attrs:
            child = getattr(node, attr)
            if isinstance(child, list):
                nodes.extend([NodeParent(_child, node) for _child in child])
            else:
                nodes.append(NodeParent(child, node))
        yield node_parent
scoder commented 1 year ago

The hardest thing is not having something like ast.walk

I suggest subclassing a TreeVisitor, as implemented in Cython.Compiler.Visitor.

MarcoGorelli commented 1 year ago

Thanks @scoder ! My issue is that that would still miss some types - for example:

from Cython.Compiler.Visitor import CythonTransform
from Cython.Compiler.TreeFragment import parse_from_strings

class MyVisitor(CythonTransform):
    def __init__(self):
        super().__init__(context=None)
        self.my_names = set()

    def visit_CSimpleBaseTypeNode(self, node):
        self.my_names.add(node.name)
        return node

content = (
    'from foo cimport bar, baz, bat\n'
    '\n'
    'ctypedef fused MyType:\n'
    '    bar\n'
    '    baz\n'
    '\n'
    'cdef void myfunc():\n'
    '    cdef int a = 0\n'
    '    a = 3\n'
    '\n'
)

tree = parse_from_strings('t.pyx', content)

my_visitor = MyVisitor()
my_visitor(tree)
print(my_visitor.my_names)

prints

{'void', 'int'}

, rather than {'void', 'int', 'bar', 'baz'}. This is because FusedTypeNode.childattrs is empty

I have to patch this by adding 'types' to child_attrs - would this (and the other patches in the link) be OK to do here in Cython directly? I asked about this one in the mailing list and was told

The effect of not including it in child_attrs is that types isn't processed by ParseTreeTransforms but just left as is. As far as I can tell nothing interesting actually gets done with CSimpleBaseTypeNode or CBaseTypeNode by anything in ParseTreeTransforms. Therefore leaving it out practically makes no difference.

It possibly is an omission, but not one that has any consequence for Cython itself.

MarcoGorelli commented 1 year ago

Hi @scoder

I'm coming across more examples of when nodes exist, but aren't visited by such a visitor

For example, in

def __getstate__():
    cdef np.intp_t size = cself.tree_buffer.size() * sizeof(ckdtreenode)
    cdef np.ndarray tree = np.asarray(<char[:size]> cself.tree_buffer.data())

then size within char[:size] wouldn't be visited.

This is because CythonArrayNode has a base_type_node attribute, but it doesn't appear in child_attrs, so it doesn't get traversed:

(Pdb) tree.body.body.stats[-1].expr.args[0]
<Cython.Compiler.ExprNodes.CythonArrayNode object at 0x7fce9077bb80>
(Pdb) tree.body.body.stats[-1].expr.args[0].child_attrs
['operand', 'shapes']
(Pdb) tree.body.body.stats[-1].expr.args[0].base_type_node
<Cython.Compiler.Nodes.MemoryViewSliceTypeNode object at 0x7fce9077bac0>

Would you be open to a PR to add this and other missing nodes to child_attrs of some classes, or might it cause issues somewhere else in Cython?

0dminnimda commented 1 year ago

Would you be open to a PR to add this and other missing nodes to child_attrs of some classes, or might it cause issues somewhere else in Cython?

Sounds good, pr is surely welcome