AST Improvement: Store dotted name parts

astral-sh / ruff

An extremely fast Python linter and code formatter, written in Rust.

https://docs.astral.sh/ruff

MIT License

32.6k stars 1.09k forks source link

AST Improvement: Store dotted name parts #7760

Open konstin opened 1 year ago

konstin commented 1 year ago

Currently, the path of imports is not formatted, e.g.

import a . b

remains as-is. This is due to a bug in our AST:

https://github.com/astral-sh/ruff/blob/6824b67f44d8d462f83a727ed8caf100d10c22a6/crates/ruff_python_ast/src/imports.rs#L6-L31

The entire path is represented as a single string, even though it should be dot-separated identifier (the parser calls it DottedName, but then emits an identifier), especially since identifier can not contain dots.

[ ] Fix the AST so the import path is a Vec<Identifier> or something similar
[x] Format import paths by removing whitespace (we can't insert parentheses like we do for dotted expressions)
[ ] Check that Identifier is only used for strings matching the rules

charliermarsh commented 1 year ago

I think the counterargument is that CPython uses this representation:

>>> print(ast.dump(ast.parse(s)))
Module(body=[Import(names=[alias(name='foo.bar')])], type_ignores=[])

And the ASDL uses the same identifier symbol as in other nodes: https://docs.python.org/3/library/ast.html.

But I care more about the ergonomics and performance than I do exact compatibility with CPython on these decisions. Would need to see what this makes easier or harder.

MichaReiser commented 1 year ago

@charliermarsh Is my understanding correct that CPython normalizes the identifier name (removes the whitespace)?

I was a bit surprised when I saw this representation first because it's somewhat uncommon (at least in the languages that I have used thus far)—especially considering that it re-joins the identifier tokens that the lexer identified.

Are there any upsides in the semantic model to have a single string? I would expect it to be easier to have the individual parts when e.g. resolving imports. The alternative is that we implement a components method similar to Rust's Path::components that returns the individual parts (splitting by string).

charliermarsh commented 1 year ago

Yeah it's probably an improvement to store a list of dot-separated segments. There is likely no upside in the semantic model since we always decompose into segments.

MichaReiser commented 1 year ago

Let us fix this, regardless of how it gets implemented. Splitting the names in the formatter would be silly, but something we could do.

I think we're at least lucky that the following is not valid

import (a # comment
   .b)

charliermarsh commented 1 year ago

I think this actually was fixed, we format it correctly: https://play.ruff.rs/86d1a181-4d27-4bbe-ad13-0c425e1976c0. I think this issue was about changing the AST to better reflect the real structure.