Issues with Tokenizer - Githubissues

jakobnissen commented 1 year ago

There are a few things I'd like to address before v1:

Should the Tokenizer even exist? It's fairly easy to make one yourself. If it should exist, perhaps it should be even more high-level, simply creating an enum and returning (enum, start:end), only for byte buffers (no need to parse IOs with Tokenizers)
The goto generator should work with tokenizers. Noone wants the slow tokenizer.
Perhaps do a full rewrite, and if so, also remove the special case for :final actions in re2nfa that special-cases tokenizer actions. Maybe there is little need for a Tokenizer object - perhaps it could just be a function that takes a series of action-less regex, which then compiles an ordinary Machine.

kescobo commented 1 year ago

Forgive me this this issue is rhetorical / documentation for your thought process, but I'm just curious about the benefit of the tokenizer - if it's not clear that it's important, the safe thing would be to remove it for 1.0, and then have the flexibility to add it back in as a later feature release?

jakobnissen commented 1 year ago

I've worked a bit more on the tokenizer the last few days, and I think they should stay for v1. The background is that tokenization (aka lexing) is a common first step in complicated parsers. Typically this is because tokenization can be done in approximately O(N), even for very complicated parsing tasks that would be too complicated to implement in Automa. For example, the upcoming new Julia parser for Julia v1.10 will use a fork of Tokenize.jl to parse Julia - and the devs briefly considered using Automa.jl for the tokenization instead.

After having worked with the tokenization for a few days, it's clear to me that 1) It's not that easy to make a tokenizer yourself, I underestimated several quirks, and 2) Compared to the existing tokenizer, it should be possible to make one that is significantly nicer.

Right now I'm eyeing these improvements:

It should be faster (maybe, not a huge priority as long as it's very fast in absolute terms). It's particularly tricky to make the tokenizer SIMD, but that's a challenge for post v1. The important part is that the interface is such that lots of optimisation is possible
It should be much easier to make a tokenizer.
It should be able to gracefully handle non-tokenizable data by emitting an error token
It should still support arbitrary code execution when a token is emitted, although this should be optional, unlike now.

It turns out Rust has a similar package for lexing, logos. I'm thinking of borrowing some ideas from that package.

jakobnissen commented 1 year ago

This is now completed in the v1 branch. I'll keep this open until v1 is released. I must say, I'm pretty happy with the changes! :)

BioJulia / Automa.jl

Issues with Tokenizer #116