interesting case to solve for tokenizer

djspiewak / sparse

Generalized, incremental parser combinators for scalaz-stream

Apache License 2.0

63 stars 5 forks source link

interesting case to solve for tokenizer #4

Open mandubian opened 9 years ago

mandubian commented 9 years ago

I'm implementing a tokenizer following your sample and I have found an interesting case.

Imagine you want to parse a floating point number with the classic regex:

"""(-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?)""".r

If I use your tokenizer, you get the following:

buffer:-
buffer:-1
buffer:-12
buffer:-123
buffer:-123.
buffer:-123.3
buffer:-123.34
buffer:-123.345
buffer:-123.345e

So the parser will fail on this but if it was trying on time more, it would succeed. So the basic idea is to tokenize the exponential and interprete it in the parser. But it requires more steps. So, would you see another way to get the whole floating number in one step?

(BTW, yes the code of tokenize can be optimized ;))

djspiewak commented 9 years ago

I think the problem here is that the regular expression in question requires more than a single character of lookahead in order to finalize a token unambiguously. I mean, I can't quite explain the exact results you're getting (some of those results should have been ruled out by the one character lookahead).

I'm not quite sure what to do about this. Maybe have tokenize take a tuning parameter which indicates how much lookahead to use?

djspiewak commented 9 years ago

Actually, when I run your example with the input -123.345e, I just get the stream of -\/-projected characters, meaning that the tokenizer rejected the entire input. Honestly, I think my implementation of tokenize is just buggy, considering that it's not even tokenizing {} correctly. :-)

mandubian commented 9 years ago

It might be buggy yes :) For strings, the partial function should be

case body :: _ :: Nil => // some code

I'll push some code soon to show you my experiences... Le 2 janv. 2015 21:40, "Daniel Spiewak" notifications@github.com a écrit :

Actually, when I run your example with the input -123.345e, I just get the stream of -\/-projected characters, meaning that the tokenizer rejected the entire input. Honestly, I think my implementation of tokenize is just buggy, considering that it's not even tokenizing {} correctly. :-)

— Reply to this email directly or view it on GitHub https://github.com/djspiewak/sparse/issues/4#issuecomment-68560275.

mandubian commented 9 years ago

I corrected a bit your function to make it work for Strings:

I'll re-work on this tomorrow

mandubian commented 9 years ago

Here is an (simply mutably) optimized version of tokenize that can match deeper:

https://github.com/mandubian/scaledn/blob/master/stream-parser/src/main/scala/package.scala#L48-L129

Now it works for my parsers at least ;)

djspiewak commented 9 years ago

LOL Very gross. However, the whole thing (as written) is pretty awful anyway since it doesn't compile to the unified DFA that would actually be not-gross. I'll try pulling in your changes and see if maybe it resolves some of the lingering issues that my wip branch has.

mandubian commented 9 years ago

Yes ugly but not important there :) Does DFA mean Deterministic Finite Automaton or something else?

The Regex Map seems not so nice as it doesn't respect the order of regex which is meaningful in general. A Seq is certainly better too. Anyway, a lexer only based on regex poses problem when you have several matching regex like Integers & Floats. The lexer must create smaller lexemes (like Exponential, Decimal, Sign) that will be matched by parsers. WDYT?