Closed zachallaun closed 6 months ago
The ways handwritten parsers generalize tokenize is a token at a time as you parse, but elixirs tokenized is designed to work with a parser generator, so it tokenizes the entire document first.
I used the internal tokenizer just as a way to skip a bunch of work and make progress, but it's been on my radar to alter the tokenizer to work how I described above.
That way as you parse, you can decide how to handle unknown tokens.
We can use this as the tracking issue.
Basically the API of the tokenizer should be "next_token" rather than "tokenize"
Are you interesting in taking this on?
The ways handwritten parsers generalize tokenize is a token at a time as you parse, but elixirs tokenized is designed to work with a parser generator, so it tokenizes the entire document first.
Makes sense!
Are you interesting in taking this on?
Potentially, yes, but it's not something I can necessarily commit to at the moment. I'll share any progress here!
Potentially, yes, but it's not something I can necessarily commit to at the moment. I'll share any progress here!
No worries! Just wanted to clarify, as this is something that I was going to tackle soon if the priorities aligned correctly.
I'll go ahead and write up a hopefully detailed ticket that describes to needed changes and the desired outcome, i'll close this issue once i get that created.
And another issue with the currently "total lexing" approach is that it bails if any non-recognized operators are lexed, which doesn't allow the parser to collect an error for that and continue on. Here is a pending test I have that describes that issue as well
# TODO: this needs a change to the tokenizer i believe, or a way to splice out the unknown token
@tag :skip
test "unknown prefix operator" do
code = "foo $bar, baz"
assert Spitfire.parse(code) ==
{:error, {:foo, [line: 1, column: 1], [{:__block__, [error: true, line: 1, column: 5], []}]},
[{[line: 1, column: 5], "unknown token: %"}]}
end
Reading the tokenizer, it seems like the fastest path would be modifying the existing tokenizer to return {token, next_tokenizer}
(or nil) where the next tokenizer is a closure that will produce the next token.
This could be somewhat easily tested against Elixir's tokenizer for valid inputs by consuming all tokens and comparing the lists.
Then, as a second step, the tokenizer could start emitting error tokens.
Closed in favor of #31
I've been thinking more about error tolerance in Elixir parsing and wanted to start a discussion about errors that come from the tokenizer.
Currently, the behavior of the tokenizer is to stop tokenizing when an error is encountered, emitting the error and the token accumulator. This means that Spitfire can only be error-tolerant to the degree that the tokenizer is. Consider this bit of example code:
When the tokenizer encounters the merge conflict, it bails out, so the best that Spitfire can do is to give us context about the module and function head, but nothing about the body and nothing about the definition for
baz
below.If the tokenizer were to accumulate errors instead, Spitfire might be able to give us the AST for the following, as well as errors that VCS merge conflict markers were present on lines 3/5/7:
While I haven't reviewed every tokenizer error case, there are a significant number where an error could be accumulated and tokenization could continue. Alternatively, the tokenizer could emit "error tokens" of some kind, e.g.
{{:error, :version_contol_marker}, {3, 1, nil}, ~c"<<<<<<< HEAD"}
and then handled in the parser.