Error-tolerant tokenization

zachallaun commented 6 months ago

I've been thinking more about error tolerance in Elixir parsing and wanted to start a discussion about errors that come from the tokenizer.

Currently, the behavior of the tokenizer is to stop tokenizing when an error is encountered, emitting the error and the token accumulator. This means that Spitfire can only be error-tolerant to the degree that the tokenizer is. Consider this bit of example code:

defmodule Foo do
  def bar(x) do
<<<<<<< HEAD
    x + 1
=======
    x + 2
>>>>>>> main
  end

  def baz(x), do: x + 3
end

When the tokenizer encounters the merge conflict, it bails out, so the best that Spitfire can do is to give us context about the module and function head, but nothing about the body and nothing about the definition for baz below.

If the tokenizer were to accumulate errors instead, Spitfire might be able to give us the AST for the following, as well as errors that VCS merge conflict markers were present on lines 3/5/7:

defmodule Foo do
  def bar(x) do

    x + 1

    x + 2

  end

  def baz(x), do: x + 3
end

While I haven't reviewed every tokenizer error case, there are a significant number where an error could be accumulated and tokenization could continue. Alternatively, the tokenizer could emit "error tokens" of some kind, e.g. {{:error, :version_contol_marker}, {3, 1, nil}, ~c"<<<<<<< HEAD"} and then handled in the parser.

mhanberg commented 6 months ago

The ways handwritten parsers generalize tokenize is a token at a time as you parse, but elixirs tokenized is designed to work with a parser generator, so it tokenizes the entire document first.

I used the internal tokenizer just as a way to skip a bunch of work and make progress, but it's been on my radar to alter the tokenizer to work how I described above.

That way as you parse, you can decide how to handle unknown tokens.

We can use this as the tracking issue.

Basically the API of the tokenizer should be "next_token" rather than "tokenize"

Are you interesting in taking this on?

zachallaun commented 6 months ago

The ways handwritten parsers generalize tokenize is a token at a time as you parse, but elixirs tokenized is designed to work with a parser generator, so it tokenizes the entire document first.

Makes sense!

Are you interesting in taking this on?

Potentially, yes, but it's not something I can necessarily commit to at the moment. I'll share any progress here!

mhanberg commented 6 months ago

Potentially, yes, but it's not something I can necessarily commit to at the moment. I'll share any progress here!

No worries! Just wanted to clarify, as this is something that I was going to tackle soon if the priorities aligned correctly.

I'll go ahead and write up a hopefully detailed ticket that describes to needed changes and the desired outcome, i'll close this issue once i get that created.

And another issue with the currently "total lexing" approach is that it bails if any non-recognized operators are lexed, which doesn't allow the parser to collect an error for that and continue on. Here is a pending test I have that describes that issue as well

    # TODO: this needs a change to the tokenizer i believe, or a way to splice out the unknown token
    @tag :skip
    test "unknown prefix operator" do
      code = "foo $bar, baz"

      assert Spitfire.parse(code) ==
               {:error, {:foo, [line: 1, column: 1], [{:__block__, [error: true, line: 1, column: 5], []}]},
                [{[line: 1, column: 5], "unknown token: %"}]}
    end

zachallaun commented 6 months ago

Reading the tokenizer, it seems like the fastest path would be modifying the existing tokenizer to return {token, next_tokenizer} (or nil) where the next tokenizer is a closure that will produce the next token.

This could be somewhat easily tested against Elixir's tokenizer for valid inputs by consuming all tokens and comparing the lists.

Then, as a second step, the tokenizer could start emitting error tokens.

mhanberg commented 6 months ago

Closed in favor of #31

elixir-tools / spitfire

Error-tolerant tokenization #29