See what we can glean from Prism

gazayas commented 1 year ago

Prism (previously called YARP) is the new parser for Ruby. Looking at the different methods it has to offer, I'm thinking #parse and #parse_lex could potentially be useful. Here's a look at #parse.

# prism.rb

require "yarp"

code = <<~CODE
x = 1
puts x
CODE

p YARP.parse(code)

> bundle exec ruby prism.rb
#<YARP::ParseResult:0x00007fe8465ba0a0 @value=@ ProgramNode (location: (0...12))
├── locals: [:x]
└── statements:
    @ StatementsNode (location: (0...12))
    └── body: (length: 2)
        ├── @ LocalVariableWriteNode (location: (0...5))
        │   ├── name: :x
        │   ├── depth: 0
        │   ├── name_loc: (0...1) = "x"
        │   ├── value:
        │   │   @ IntegerNode (location: (4...5))
        │   └── operator_loc: (2...3) = "="
        └── @ CallNode (location: (6...12))
            ├── receiver: ∅
            ├── call_operator_loc: ∅
            ├── message_loc: (6...10) = "puts"
            ├── opening_loc: ∅
            ├── arguments:
            │   @ ArgumentsNode (location: (11...12))
            │   └── arguments: (length: 1)
            │       └── @ LocalVariableReadNode (location: (11...12))
            │           ├── name: :x
            │           └── depth: 0
            ├── closing_loc: ∅
            ├── block: ∅
            ├── flags: 
            └── name: "puts"
, @comments=[], @errors=[], @warnings=[], @source=#<YARP::Source:0x00007fe846647360 @source="x = 1\nputs x\n", @offsets=[0, 6, 13]>>

What's different here is that the location is displayed as a range within the entire program itself. So, for example, this program has code from 0...12, and the integer 1 shows up inside 0...12 at 4...5.

This doesn't display line number information like Ripper does (and like what we currently do with Masamune). However, having this information can be really useful for blocks. Check this out:

# prism.rb

require "yarp"

code = <<~CODE
x = 1
if x == 1
  puts "x is 1"
end
CODE

p YARP.parse(code)

> bundle exec ruby prism.rb
#<YARP::ParseResult:0x00007fc4d304bd30 @value=@ ProgramNode (location: (0...35))
├── locals: [:x]
└── statements:
    @ StatementsNode (location: (0...35))
    └── body: (length: 2)
        ├── @ LocalVariableWriteNode (location: (0...5))
        │   ├── name: :x
        │   ├── depth: 0
        │   ├── name_loc: (0...1) = "x"
        │   ├── value:
        │   │   @ IntegerNode (location: (4...5))
        │   └── operator_loc: (2...3) = "="
        └── @ IfNode (location: (6...35))
            ├── if_keyword_loc: (6...8) = "if"
            ├── predicate:
            │   @ CallNode (location: (9...15))
            │   ├── receiver:
            │   │   @ LocalVariableReadNode (location: (9...10))
            │   │   ├── name: :x
            │   │   └── depth: 0
            │   ├── call_operator_loc: ∅
            │   ├── message_loc: (11...13) = "=="
            │   ├── opening_loc: ∅
            │   ├── arguments:
            │   │   @ ArgumentsNode (location: (14...15))
            │   │   └── arguments: (length: 1)
            │   │       └── @ IntegerNode (location: (14...15))
            │   ├── closing_loc: ∅
            │   ├── block: ∅
            │   ├── flags: 
            │   └── name: "=="
            ├── statements:
            │   @ StatementsNode (location: (18...31))
            │   └── body: (length: 1)
            │       └── @ CallNode (location: (18...31))
            │           ├── receiver: ∅
            │           ├── call_operator_loc: ∅
            │           ├── message_loc: (18...22) = "puts"
            │           ├── opening_loc: ∅
            │           ├── arguments:
            │           │   @ ArgumentsNode (location: (23...31))
            │           │   └── arguments: (length: 1)
            │           │       └── @ StringNode (location: (23...31))
            │           │           ├── opening_loc: (23...24) = "\""
            │           │           ├── content_loc: (24...30) = "x is 1"
            │           │           ├── closing_loc: (30...31) = "\""
            │           │           └── unescaped: "x is 1"
            │           ├── closing_loc: ∅
            │           ├── block: ∅
            │           ├── flags: 
            │           └── name: "puts"
            ├── consequent: ∅
            └── end_keyword_loc: (32...35) = "end"
, @comments=[], @errors=[], @warnings=[], @source=#<YARP::Source:0x00007fc4d319b8e8 @source="x = 1\nif x == 1\n  puts \"x is 1\"\nend\n", @offsets=[0, 6, 16, 32, 36]>>

This gives us the exact ending location for the if statement!

@ IfNode (location: (6...35))

We do a lot of namespace/resource block editing in Bullet Train, so if we can manage to implement this in Masamune with Prism, we can be a lot more confident with finding/editing blocks instead of trying to find the line numbers based off of whitespace indentation.

Perhaps we can add a new attribute to Data Nodes in Masamune that have this location besides the one we already provide (currently we provide something like [4, 7], the line number and the index on the line where the identifier starts).

syntax_tree

I think we can still glean some things from #11 as well, but getting the exact location as a range like this seems more useful right now.

kaspth commented 1 year ago

I've just been looking through Masamune's code and one thing that sticks out to me is the result_type: that's passed all around. One of the cool things about Prism is that it takes full advantage of Ruby's pattern matching by implementing deconstruct and deconstruct_keys, and if we copy something like that to our nodes, we could remove the need for the Hash result_type.

Basically, the Hash result_type suggests to me that there's a leak in the Node API, and we improve them they'd be easier to work with than Hashes. Which could simplify calling code.

gazayas commented 1 year ago

Basically, the Hash result_type suggests to me that there's a leak in the Node API, and we improve them they'd be easier to work with than Hashes. Which could simplify calling code.

After using hashes for a while I definitely think it's best to return the node classes themselves, so the next two things I want to prioritize is:

Registering Prism nodes to get location. Also for use with search methods (i.e - msmn.variables)
Returning node classes as opposed to simple hashes.

That way we can add methods to the nodes down the line which will be easier to manage as you mentioned instead of just adding values to a hash.

Concerning the first issue, I'm finding that Prism doesn't actually contain the token sometimes in the results:

Prism.parse("7")

@ ProgramNode (location: (1,0)-(1,1))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,1))
    └── body: (length: 1)
        └── @ IntegerNode (location: (1,0)-(1,1))
            └── flags: decimal

So instead of completely replacing the tree parsing we have in place now, I plan on opening a PR to use Prism nodes alongside the current code and see what will work best long-term. If perchance I'm overlooking something with Prism, then we can implement it in the PR as well.

gazayas commented 1 year ago

It also looks like location itself is changing. Whereas before it was a Range, now it looks like (1,0)-(1,1)) (I have a PR to remove the trailing parenthesis here).

There are some helpful methods in the Location class though:

[1] pry(main)> node.location
=> (1,0)-(2,6))
[2] pry(main)> node.location.start_line
=> 1
[3] pry(main)> node.location.start_column
=> 0
[4] pry(main)> node.location.end_line
=> 2
[5] pry(main)> node.location.end_column
=> 6

kddnewton commented 1 year ago

Hi there 👋! Author of prism here.

I took a look at your code here and I would definitely suggest just passing around Prism nodes. They're going to have all of the location that you need so you won't have to implement your own syntax tree.

The tokens themselves you can always fetch using the Prism::Node#slice (which is an delegation for Prism::Node#location#slice). Every node and location object can always fetch itself from the source. Prism::Location itself stores a byte offset and a byte length, but you can use the query methods you discovered in the above comment to retrieve any additional information you might want from it.

You can use pattern matching expressions with anything that you might want, but I would also suggest checking out the Prism::Pattern class, which is designed for finding specific patterns within a tree. The other thing you might want to check out is Prism::MutationCompiler, which is designed for doing the kind of rewriting that you're doing. (For an example of how that might work, check out Prism::DesugarCompiler.

As for Syntax Tree, the nodes that are a part of it are going to go away in the next major release, in favor of using Prisms AST directly. They're essentially a less effective version of the same nodes with less useful structure and less information.

If you have anything that you feel like is missing from the Prism AST structure or the nodes themselves, I would be more than happy to add it. You can see a couple of node-specific methods here: https://github.com/ruby/prism/blob/main/lib/prism/node_ext.rb. Anything else that you might want to add to make your use-case easier just let me know.

The point of prism was to help the community come together on a single AST design so that we could build great tooling like masamune without having to relearn the parser. So if there's anything I can do to support you, please let me know!

gazayas commented 1 year ago

@kddnewton Thank you so much for the feedback!

`slice`

I see, I was able to retrieve the token directly from the node like you mentioned:

[1] pry(main)> parse_result = Prism.parse("7")
=> #<Prism::ParseResult:0x00007f38fbe810a8
 @comments=[],
 @errors=[],
 @source=#<Prism::Source:0x00007f38fbec2a80 @offsets=[0], @source="7">,
 @value=
  @ ProgramNode (location: (1,0)-(1,1))
  ├── locals: []
  └── statements:
      @ StatementsNode (location: (1,0)-(1,1))
      └── body: (length: 1)
          └── @ IntegerNode (location: (1,0)-(1,1))
              └── flags: decimal,
 @warnings=[]>
[2] pry(main)> parse_result.value.slice
=> "7"

That will definitely make it easier to standardize on Prism as opposed to having a hybrid of Prism and pure Ripper to populate Masamune results.

`Prism::MutationCompiler`

I'll be sure to take a look into this and Prism::DesugarCompiler! Rewriting source code precisely in an easy-to-read manner is a big reason as to why Masamune came to be, so I'll see what I can implement here from Prism::MutationCompiler.

Other thoughts

I'm definitely interested in ASTs and Ruby source code manipulation, so I'd like to continue browsing through Prism. I'll be sure to open an issue for a feature request if I'd like to see something implemented, or I might open another PR if anything! Thanks again!

gazayas / masamune-ast