Open alexbw opened 7 years ago
Cython's internal AST isn't really for public consumption... Nothing on this front has changed much since those links (other than the grammar being checked in, but it's still not used by default).
On Sun, Mar 12, 2017 at 6:43 AM, Alex Wiltschko notifications@github.com wrote:
Hi all, I'm trying to root around and see if I can re-use the parser and intermediate-representation (perhaps an AST?) that the Cython project uses for a separate project of mine. I've been able to find several wiki pages proposing changes to the system ( ref1 https://github.com/cython/cython/wiki/enhancements-Parser, ref2 https://github.com/cython/cython/wiki/enhancements-Ast), but no up-to-date (read: 2016) summary of the system.
It's been awhile since I've used Cython, but I remember really enjoying the commented-out Python code printed above each block of generated C code, it greatly helped with debugging. The Python ast module doesn't support such a thing, although I believe the lib2to3 module does (although it seems a bit more heavy-weight).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cython/cython/issues/1628, or mute the thread https://github.com/notifications/unsubscribe-auth/ATqe4s317hhRkgNn_SHEzSqTQyx5ycd1ks5rk_aMgaJpZM4MaivP .
-- You received this message because you are subscribed to the Google Groups "cython-github-notifications" group. To unsubscribe from this group and stop receiving emails from it, send an email to cython-github-notifications+unsubscribe@googlegroups.com. To post to this group, send email to cython-github-notifications@ googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/ msgid/cython-github-notifications/cython/cython/issues/1628%40github.com https://groups.google.com/d/msgid/cython-github-notifications/cython/cython/issues/1628%40github.com?utm_medium=email&utm_source=footer . For more options, visit https://groups.google.com/d/optout.
Has there been any update since mid 2017? Such Parser may be helpful for code-formatters. See https://github.com/ambv/black/issues/359 https://github.com/google/yapf/issues/175
No, there hasn't been any update. The justification to base this on CPython's grammar was to easily see the diff from Python itself.
On Thu, Nov 8, 2018 at 7:09 AM Leonard Lausen notifications@github.com wrote:
Has there been any update since mid 2017? Such Parser may be helpful for code-formatters. See ambv/black#359 https://github.com/ambv/black/issues/359 google/yapf#175 https://github.com/google/yapf/issues/175
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cython/cython/issues/1628#issuecomment-436884221, or mute the thread https://github.com/notifications/unsubscribe-auth/AAdqgW6K7IA75ptZAXeuG1FpkbfEm2yOks5us8qTgaJpZM4MaivP .
Also note that the grammar wasn't updated with the latest syntax enhancements, especially the new features in Py3.5/6/7.
I'm not sure I want to commit to a stable, public AST based on our current parse tree since it might hinder us in modifying it in the future. Maintaining an official grammar, however, might already be a helpful step for some projects.
Moving the off-topic part of #2825 conversation here.
TL;DR: I think to accomplish this goal there needs to be a non-trivial refactoring.
An Attempt to Summarize The Status Quo
For discussion purposes, I'm going to focus on cythonize
as called from within a setup.py module. The mental model I had for cythonize before actually looking at the code was that it operated in roughly four stages:
1) Use user-passed information and file-system lookups to build a CompilationOptions
object and collect all the relevant source files.
2) Parse the cython source files into some sort of internal representation, call it a Pseudo-AST (pAST)
3) Apply optimizations to the pAST.
4) Convert the pAST into C code.
The goal for this Issue as I understand it is to expose the pAST at the end of Step2 (or 3). So two questions:
(NB: I've spent enough time trying to understand the code to have opinions about it, but not well-informed opinions. Take all statements below with an implicit "If I understand correctly...")
You have to squint and tilt your head a bit to see the correspondence between the code and these four steps.
Step 1 would include most of Build.Cythonize, Build.Dependencies, parts of Compiler.Main, and all of Compiler.Options, Compiler.DebugFlags, Compiler.CmdLine. Also Tempita I guess.
Step 2 would include Plex and Compiler.Scanning (possibly Errors and Future for the sake of the dependency graph). There are other Compiler modules that might belong here conceptually (e.g. Parsing), but I have a really hard time telling because of the dependencies within Compiler.
Step 3 I'm pretty sure would include Pipeline and Optimize. Everything else I can't really tell where Step 3 stops and Step 4 starts.
Brainstorming A GamePlan
I claim that it is possible to refactor the code such that Step 1 is self-contained and returns a CompilationOptions object. The remainder of Cython.Build would then have an entrypoint that just takes a CompilationOptions object as its sole input.
From there I'm not sure what to do. The first thing Step 2 would do is create a Context object from the passed CompilationOptions. After that I haven't tracked down how it gets from there to Scanning and Plex.
You might be interested in https://github.com/cython/cython/wiki/DevDocs
The main entry point for 2. is the parser in Parsing.py
, which also creates and executes the scanner. The parser (usually) returns a ModuleNode
, which is the main syntax tree node for a module. The syntax tree is then further processed (tranformed) in multiple traversals in the following steps, which modify and specialise it. See Pipeline.py
. In the last step, the tree is traversed to generate C code from it. You can enable the "verbose pipeline" mode to see the different steps during a compiler run.
What this ticket essentially wants is to expose the parse tree with a stable API. The word "stable" is the problem here, because so far, the parse tree always was an internal implementation detail of Cython that is subject to change at any time.
You might be interested in [...]
Thanks. I read through the ppt slides and skimmed some of the other pages in the wiki. It seems like separation of stages and contributor learning curve is a recurring topic.
The main entry point for 2. is the parser in
Parsing.py
, which also creates and executes the scanner. The parser (usually) returns aModuleNode
That's helpful, thank you. It may be the case that an API that lets me get back a ModuleNode will be good enough for my purposes. @alexbw what do you think?
@scoder would you be open to the refactor of Step 1 mentioned above? I think simplification of the dependency structure would soften the learning curve for newcomers. Both the learning curve and the refactor are mentioned in the Wiki you linked. The general idea of making the separation between Steps 2 and 3 explicit seems to be similar to CEP 901, CEP 905
CEP 901 is essentially how things work today, using a multi-step pipeline with transformations that use the visitor pattern. You can chain pipeline steps to get whatever outcome you want (although many steps depend on previous ones – and before you ask, yes, for good reason: it's a pipeline). Note that the Wiki pages are much older than the current implementation.
Also look at the TreeFragment
module, it's a way to run a user defined pipeline against a code string. We use it in a couple of places (also in tests) to compile and integrate code fragments.
I agree that most of the modules in Cython.Build
do not exactly shine from their design. They mostly work and get their job done. Dependencies
implements most of cythonize()
, for example, which is weird. Still, most of that isn't really important for the problem at hand. If all you want is a parse tree, then look at Main.py
and its parse()
function. That module is the original entry point of the cython
frontend tool.
The newer cythonize()
implementation is written in a way that mostly avoids touching the original code, which IMHO is the wrong trade-off. There is quite a bit of obfuscation going on when it comes to finding files, also relative to the current source file or in the Python path. Lots of jumping back and forth between methods. If you want to try a refactoring, that's a good target. But it's not in this ticket.
To get to know the parse tree it's probably best to follow the HackerGuide, enable the debug_trace_code_generation
debug option, and then look at the C code that Cython generates.
As a note, after dabbling a bit on the cython code, I'm using
from Cython.Compiler.TreeFragment import parse_from_strings
mod = parse_from_strings(name, source)
to obtain the AST and it seems it's working properly for me (as a note, I started with TreeFragment
but went for parse_from_strings
because with TreeFragment
changes the source code removing empty lines, which means that lines don't match with that API).
Thanks @fabioz - I'm also using parse_from_strings
in https://github.com/MarcoGorelli/cython-lint and it seems to be working fine
The hardest thing is not having something like ast.walk
, so I'm having to do
child_attrs = set(copy.deepcopy(node.child_attrs))
if isinstance(node, CClassDefNode):
child_attrs.update(['bases', 'decorators'])
elif isinstance(node, TypecastNode):
child_attrs.add('base_type')
elif isinstance(node, GeneratorExpressionNode):
if hasattr(node, 'loop'):
child_attrs.add('loop')
else: # pragma: no cover
err_msg(node, 'GeneratorExpressionNode with loop attribute')
elif isinstance(node, CFuncDefNode):
child_attrs.add('decorators')
elif isinstance(node, FusedTypeNode):
child_attrs.add('types')
elif isinstance(node, ForInStatNode):
child_attrs.add('target')
elif isinstance(node, NewExprNode):
child_attrs.add('cppclass')
for attr in child_attrs:
child = getattr(node, attr)
if isinstance(child, list):
nodes.extend([NodeParent(_child, node) for _child in child])
else:
nodes.append(NodeParent(child, node))
yield node_parent
The hardest thing is not having something like
ast.walk
I suggest subclassing a TreeVisitor
, as implemented in Cython.Compiler.Visitor.
Thanks @scoder ! My issue is that that would still miss some types - for example:
from Cython.Compiler.Visitor import CythonTransform
from Cython.Compiler.TreeFragment import parse_from_strings
class MyVisitor(CythonTransform):
def __init__(self):
super().__init__(context=None)
self.my_names = set()
def visit_CSimpleBaseTypeNode(self, node):
self.my_names.add(node.name)
return node
content = (
'from foo cimport bar, baz, bat\n'
'\n'
'ctypedef fused MyType:\n'
' bar\n'
' baz\n'
'\n'
'cdef void myfunc():\n'
' cdef int a = 0\n'
' a = 3\n'
'\n'
)
tree = parse_from_strings('t.pyx', content)
my_visitor = MyVisitor()
my_visitor(tree)
print(my_visitor.my_names)
prints
{'void', 'int'}
, rather than {'void', 'int', 'bar', 'baz'}
. This is because FusedTypeNode.childattrs
is empty
I have to patch this by adding 'types'
to child_attrs
- would this (and the other patches in the link) be OK to do here in Cython
directly? I asked about this one in the mailing list and was told
The effect of not including it in
child_attrs
is thattypes
isn't processed byParseTreeTransforms
but just left as is. As far as I can tell nothing interesting actually gets done withCSimpleBaseTypeNode
orCBaseTypeNode
by anything inParseTreeTransforms
. Therefore leaving it out practically makes no difference.It possibly is an omission, but not one that has any consequence for Cython itself.
Hi @scoder
I'm coming across more examples of when nodes exist, but aren't visited by such a visitor
For example, in
def __getstate__():
cdef np.intp_t size = cself.tree_buffer.size() * sizeof(ckdtreenode)
cdef np.ndarray tree = np.asarray(<char[:size]> cself.tree_buffer.data())
then size
within char[:size]
wouldn't be visited.
This is because CythonArrayNode
has a base_type_node
attribute, but it doesn't appear in child_attrs
, so it doesn't get traversed:
(Pdb) tree.body.body.stats[-1].expr.args[0]
<Cython.Compiler.ExprNodes.CythonArrayNode object at 0x7fce9077bb80>
(Pdb) tree.body.body.stats[-1].expr.args[0].child_attrs
['operand', 'shapes']
(Pdb) tree.body.body.stats[-1].expr.args[0].base_type_node
<Cython.Compiler.Nodes.MemoryViewSliceTypeNode object at 0x7fce9077bac0>
Would you be open to a PR to add this and other missing nodes to child_attrs
of some classes, or might it cause issues somewhere else in Cython?
Would you be open to a PR to add this and other missing nodes to
child_attrs
of some classes, or might it cause issues somewhere else in Cython?
Sounds good, pr is surely welcome
Hi all, I'm trying to root around and see if I can re-use the parser and intermediate-representation (perhaps an AST?) that the Cython project uses for a separate project of mine. I've been able to find several wiki pages proposing changes to the system (ref1, ref2), but no up-to-date (read: 2016) summary of the system.
It's been awhile since I've used Cython, but I remember really enjoying the commented-out Python code printed above each block of generated C code, it greatly helped with debugging. The Python
ast
module doesn't support such a thing, although I believe thelib2to3
module does (although it seems a bit more heavy-weight).