BioJulia / Automa.jl

A julia code generator for regular expressions
Other
188 stars 15 forks source link

Issues with Tokenizer #116

Closed jakobnissen closed 1 year ago

jakobnissen commented 1 year ago

There are a few things I'd like to address before v1:

kescobo commented 1 year ago

Forgive me this this issue is rhetorical / documentation for your thought process, but I'm just curious about the benefit of the tokenizer - if it's not clear that it's important, the safe thing would be to remove it for 1.0, and then have the flexibility to add it back in as a later feature release?

jakobnissen commented 1 year ago

I've worked a bit more on the tokenizer the last few days, and I think they should stay for v1. The background is that tokenization (aka lexing) is a common first step in complicated parsers. Typically this is because tokenization can be done in approximately O(N), even for very complicated parsing tasks that would be too complicated to implement in Automa. For example, the upcoming new Julia parser for Julia v1.10 will use a fork of Tokenize.jl to parse Julia - and the devs briefly considered using Automa.jl for the tokenization instead.

After having worked with the tokenization for a few days, it's clear to me that 1) It's not that easy to make a tokenizer yourself, I underestimated several quirks, and 2) Compared to the existing tokenizer, it should be possible to make one that is significantly nicer.

Right now I'm eyeing these improvements:

It turns out Rust has a similar package for lexing, logos. I'm thinking of borrowing some ideas from that package.

jakobnissen commented 1 year ago

This is now completed in the v1 branch. I'll keep this open until v1 is released. I must say, I'm pretty happy with the changes! :)