Use separate tokenizers

tlienart commented 1 year ago

Hello,

In the docs, there's the following general pattern to tokenize a string:

rules = [
 :name => re"...",
...
]
@eval @enum Token errortoken $(first.(tokens)...)
make_tokenizer((errortoken, 
    [Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval
tokens = collect(tokenize(Token, s))

Is there an approach where I could call make_tokenizer for several different set of "rules" and then call tokenize with the respective tokenizer? Ideally I'd like to have:

function get_tokens_1(s)
  # uses first set of rules
  return collect(tokenize(...))
end
function get_tokens_2(s)
  # uses second set of rules
  return collect(tokenize(...))
end

Using make_tokenizer in each of these functions would work but is undesirable as its overhead is significant (and I need to call these functions a lot).

I tried with version but either I didn't understand how to use it or it didn't work

make_tokenizer(set_of_rules_1, 1) |> eval
make_tokenizer(set_of_rules_2, 2) |> eval

tokenize(s, 2)

I'd have expected this to correspond to the second set of rules but it doesn't look like that's the case

Thanks in advance

jakobnissen commented 1 year ago

You should be able to just call make_tokenizer for each of your different rules, I think that's the best approach.

Using make_tokenizer in each of these functions would work but is undesirable as its overhead is significant (and I need to call these functions a lot).

The thing is that make_tokenizer creates code, which when evaluated defines a bunch of methods. It's designed to be called at top-level in the module, such that the work happens at precompile time when the package is installed. So, while it has a lot of overhead, this should be completely compiled away. Of course, this also implies you can't make hundreds of different rulesets, because then you'd generate too much code and both compile times and package image sizes would get too high.

Here is an example:

julia> rules_1 = [
        :name => re"[A-Z]+",
        :digits => re"[0-9]+"
       ]
       rules_2 = [
        :name => re"[a-zA-Z]+", # different rule here
        :digits => re"[0-9]+"
       ]
       @assert first.(rules_1) == first.(rules_2)
       @eval @enum Token errortoken $(first.(rules_1)...)
       make_tokenizer(
           (errortoken,
           [Token(i) => j for (i,j) in enumerate(last.(rules_1))]);
           version=1
       ) |> eval
       make_tokenizer(
           (errortoken,
           [Token(i) => j for (i,j) in enumerate(last.(rules_2))]);
           version=2
       ) |> eval

julia> collect(Tokenizer{Token, String, 1}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
 (1, 5, errortoken)
 (6, 3, digits)

julia> collect(Tokenizer{Token, String, 2}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
 (1, 5, name)
 (6, 3, digits)

Now there IS a bug where calling tokenize will simply ignore the version and always use version=1. I'll get that patched and tested.

jakobnissen commented 1 year ago

You don't actually need the token names of the two rules to be the same

jakobnissen commented 1 year ago

Oh dear, looking through this code again and looking at the tests for the tokenizer, I see several quite embarassing issues that I don't know how I could have missed! I even get LLVM errors running the test suite on Julia 1.10-beta2. I'm unfortunately quite busy the next five days or so, but I'll put it in my calendar to check this early next week.

tlienart commented 1 year ago

thanks a lot, your example should be all I need with the version thing. Glad it made you find some bugs along the way too 😄 .

jakobnissen commented 1 year ago

An upstream bug prevents me from working too much on this now: https://github.com/JuliaLang/julia/issues/51267 - but once that it fixed, I'll update the Tokenizer docs and fixup the code.

jakobnissen commented 1 year ago

Solved in #126 - I just worked around the upstream issue since any bugfix will not be backported to 1.6 anyway.

BioJulia / Automa.jl

Use separate tokenizers #125