Closed andreypopp closed 2 years ago
Reduced example:
import Automa
import Automa.RegExp: @re_str
re = Automa.RegExp
tokenizer = Automa.compile(re"ab" => :(emit(:string)),)
context = Automa.CodeGenContext()
@eval function tokenize(data)
$(Automa.generate_init_code(context, tokenizer))
p_end = p_eof = sizeof(data)
$(Automa.generate_exec_code(context, tokenizer))
return cs
end
@assert tokenize("a") < 0
It looks like the original author forgot to add a clause for when the tokenizer just stops in the middle of a token. I think it should be fairly easy to fix, I'll see if I can't push a fix later today. In tokenize.jl
, lines 85-ish
if p > p_eof ≥ 0 && cs ∈ $(tokenizer.machine.final_states)
$(eof_action_code)
cs = 0
elseif cs < 0
p -= 1
end
The tokenizer needs to set cs = -cs
if p > p_eof && cs ∉ $(tokenizer.machine.final_states)
...
More ambitiously, the tokenizer could use most of the Machine
's codegen machinery, perhaps by simply calling it internally since it contains a machine.
Given this code (a stripped down tokenizer example):
the case:
produces just a single token for space and notices no error. I'd expect it to fail.