elixir-lang / tree-sitter-elixir

Elixir grammar for tree-sitter
https://elixir-lang.org/tree-sitter-elixir
Apache License 2.0
248 stars 25 forks source link

querying 'not in' #9

Closed the-mikedavis closed 3 years ago

the-mikedavis commented 3 years ago

Hi again :wave:

So I'm pretty sure this is a bug in the tree-sitter query mechanism but I can't get the not in binary operator to match a query.

I see in docs (which are lovely by the way :) that not in is parsed with the external scanner, so my guess is that something in the lexer (either in scanner.cc or something in tree-sitter) is not getting the full information it needs to understand the not in token (byte/codepoint starts/stops maybe?). I also suspect it might be possible to fix this with extra rules in the grammar.js.

Comparing the query results from a standard binary_operator like in with not in we see:

$ echo "a in b" > in.exs
$ echo "a not in b" > not-in.exs
$ echo "(binary_operator operator: _ @operator)" > query.scm
$ tree-sitter query query.scm in.exs
in.exs
  pattern: 0
    capture: operator, start: (0, 2), text: "in"
$ tree-sitter query query.scm not-in.exs
not-in.exs

# no match! (╯°□°)╯︵ ┻━┻

(The in.exs can be replaced by other binary_operators such as ++ to the same effect.)

Looking at the parse trees, there's some peculiar behavior where not in doesn't show up but other binary operators do:

$ tree-sitter parse -x in.exs
(source [0, 0] - [1, 0]
  (binary_operator [0, 0] - [0, 6]
    left: (identifier [0, 0] - [0, 1])
    right: (identifier [0, 5] - [0, 6])))
<source>
  <binary_operator>
    <identifier type="left">a</identifier>
in
    <identifier type="right">b</identifier>
</binary_operator>
</source>

$ tree-sitter parse -x not-in.exs
(source [0, 0] - [1, 0]
  (binary_operator [0, 0] - [0, 10]
    left: (identifier [0, 0] - [0, 1])
    right: (identifier [0, 9] - [0, 10])))
<source>
  <binary_operator>
    <identifier type="left">a</identifier>

    <identifier type="right">b</identifier>
</binary_operator>
</source>

And I think it's odd that not in is not there between the <identifier>s :thinking:. That code in the tree-sitter CLI I think is this block:

let start = node.start_byte();
let end = node.end_byte();
let value =
    std::str::from_utf8(&source_code[start..end]).expect("has a string");
write!(&mut stdout, "{}", html_escape::encode_text(value))?;

Which is why I suspect there might be some missing information about the start and stop bytes of the $._not_in rule. (Although as we'll see below, the parser does seem to know the start/stop bytes when $._not_in is changed to be a non-hidden rule.)

a possible but not-great workaround

One workaround which allows that query.scm to match (and therefore query/highlight not in the same as any other binary_operator) is to change the grammar.js's rule for $._not_in to $.not_in so as to unhide it. Then we see a parse result of

$ tree-sitter generate
$ tree-sitter parse -x not-in.exs
(source [0, 0] - [1, 0]
  (binary_operator [0, 0] - [0, 10]
    left: (identifier [0, 0] - [0, 1])
    operator: (not_in [0, 2] - [0, 8])
    right: (identifier [0, 9] - [0, 10])))
<source>
  <binary_operator>
    <identifier type="left">a</identifier>

    <not_in type="operator">not in</not_in>

    <identifier type="right">b</identifier>
</binary_operator>
</source>

And the query matches!

$ tree-sitter query query.scm not-in.exs
not-in.exs
  pattern: 0
    capture: operator, start: (0, 2), text: "not in"

It also works as expected with arbitrary whitespace like a not in b.

This seems pretty hacky to me though to throw an extra operator node in there just for the sake of making this match the query though.

I haven't dug too deep yet into the tree-sitter codebase yet to try to hunt down exactly what's going on. I thought I'd write out my findings here first in hopes that you might already know why this parses strangely and can't be queried.

jonatanklosko commented 3 years ago

Hey, thanks for the detailed report! I was playing with this for a bit, and managed to narrow it down. In the end it's not even related to the use of external scanner, but just the hidden node. I will submit an issue to tree-sitter with minimal reproduction shortly :)

jonatanklosko commented 3 years ago

Reported in https://github.com/tree-sitter/tree-sitter/issues/1441! I actually found a reasonable solution, so will go ahead and use it :)

jonatanklosko commented 3 years ago

@the-mikedavis it should works as expected now, let me know if you encounter any issues :)

the-mikedavis commented 3 years ago

Ah you're awesome, great find!

I just merged those changes in to the helix PR and it works!

not-in-compare

jonatanklosko commented 3 years ago

Beautiful! :cat: