blynn / nex

Lexer for Go
http://cs.stanford.edu/~blynn/nex/
GNU General Public License v3.0
416 stars 47 forks source link

regex does not match with parens correctly #50

Closed purpleidea closed 6 years ago

purpleidea commented 6 years ago

I have the following regex:

/[a-z]([a-z0-9:]*[a-z0-9]+)?/

This is intended to match: g or hey42 or hey:there but NOT: blah: trailing colons should not be allowed. Yet it does.

I need more extensive testing, but AFAIK this is pretty serious matching bug!

I also tested

/[a-z]+([a-z0-9:]*[a-z0-9]+)?/

Which has the same problem :/

Digging deeper... Help welcome =D

purpleidea commented 6 years ago

Confirmed, this must be a bug. Here's a simple reproducer:

/[ \t\n]/   { /* skip over whitespace */ }
/[a-z]([a-z0-9:]*[a-z0-9]+)?/
        {
            s := yylex.Text()
            fmt.Printf("TEXT: %s\n", s)
        }
/#[^\n]*/
        {
            s := yylex.Text()
            fmt.Printf("COMMENT: %s\n", s)
        }
/./     {
            s := yylex.Text()
            fmt.Printf("ERROR: %s\n", s)
        }
//
package main
import ("fmt";"os")
type yySymType struct { l, c int }
func main() {
    v := new(yySymType)
    NewLexer(os.Stdin).Lex(v)
    //fmt.Printf("%d %d\n", v.l, v.c)
}

Put that code in bug.nex, then run nex bug.nex Then make an input file to test. I used:

# comment 1
hello42:
# comment 2
a
blah:42x

Then I run:

cat input | go run bug.nn.go

and the output is:

COMMENT: # comment 1
TEXT: hello42:
COMMENT: # comment 2
TEXT: a
TEXT: blah:42x

As you can see it matches the hello42: string, which should NOT match. To prove it, here's that regexp in pure golang:

https://play.golang.org/p/0xbSLawtmgw

package main

import (
    "fmt"
    "regexp"
)

func main() {
    // Compile the expression once, usually at init time.
    // Use raw strings to avoid having to quote the backslashes.
    var validID = regexp.MustCompile(`[a-z]([a-z0-9:]*[a-z0-9]+)?`)

    fmt.Println(validID.FindString("a"))         // true
    fmt.Println(validID.FindString("hello"))     // true
    fmt.Println(validID.FindString("hello42"))   // true
    fmt.Println(validID.FindString("hello42:"))  // false
    fmt.Println(validID.FindString("hello42@"))  // false
    fmt.Println(validID.FindString("hello42:x")) // true
}

So there's a very serious bug in the nex lexer. Understanding the code base is a bit more difficult than I was hoping, so with any chance, someone can take a look at this. @blynn @wolfkdy @drewwells

Thanks

blynn commented 6 years ago

Thanks for the bug report. I found an embarrassing bug that caused the code to build the wrong graph in some cases for '?'.

purpleidea commented 6 years ago

Thanks for the bug report. I found an embarrassing bug that caused the code to build the wrong graph in some cases for '?'.

You mean I found an embarrassing bug, and you found the fix! ;)

My initial tests look like this is fixed now. Much appreciated, I really didn't want to dig into figuring out all the DFA stuff =D

Would it be helpful if I wrote more tests, or would it be annoying PR noise? (Alternatively I can put them in https://github.com/purpleidea/mgmt/ instead)

Thanks again!

blynn commented 6 years ago

Tests would be most welcome. Would you like write access to the repo?

In fact, I'd love someone to take this repo off my hands. It's not just because the code is messy, but also because it should use a different algorithm.

Instead of Thompson's construction (regex -> NFA -> DFA), I should have used regex derivatives, which support more powerful regexes yet are simpler to implement, e.g. I wrote an regex engine using derivatives in 30 lines or so: http://benlynn.blogspot.com/2018/06/regex-derivatives.html and I can actually understand my code a week after I wrote it!

I'd be willing to beef it up so it's on par with Nex, but I'm not sure how people would feel about a tool for Go whose source is in Haskell.

purpleidea commented 6 years ago

Tests would be most welcome. Would you like write access to the repo?

That would be great!

In fact, I'd love someone to take this repo off my hands. It's not just because the code is messy, but also because it should use a different algorithm.

I'm happy to maintain it as best I can, except I likely have zero chance in being able to make any fixes like the ? fix you just merged. I just don't have all the theory background and I completely forget my regex -> DFA transforms. Assuming nothing major like that comes up, I'm sure I can keep up with adding tests, looking at issues, and keeping it healthy for at least the near future. I've got a couple of blog posts coming about nex as well. You're welcome to transfer it to purpleidea if you like, or add me here.

Instead of Thompson's construction (regex -> NFA -> DFA), I should have used regex derivatives, which support more powerful regexes yet are simpler to implement, e.g. I wrote an regex engine using derivatives in 30 lines or so: http://benlynn.blogspot.com/2018/06/regex-derivatives.html and I can actually understand my code a week after I wrote it!

I'll have a look. To be honest, the FRP DSL that I am implementing in mgmt is my first real language, so lexing/parsing/etc is somewhat new to me, but I'll try to keep up where possible. https://purpleidea.com/blog/2018/02/05/mgmt-configuration-language/ for more information if you're curious.

I'd be willing to beef it up so it's on par with Nex, but I'm not sure how people would feel about a tool for Go whose source is in Haskell.

I think it's better to have it in golang unfortunately, mainly just to avoid the build dep :/

In any case, I've been pretty happy with nex, so thank you!

blynn commented 6 years ago

This is a little off-topic, but funnily enough, recently I've been trying to learn more about FRP as well as DSLs! My interest was piqued by one of my coworkers, who researched FRP during his PhD: http://thev.net/PaulLiu/ (his advisor, Paul Hudak, wrote one of the classic FRP papers with Conal Elliott).

And speaking of Conal Elliott, I just watched his talk https://www.youtube.com/watch?v=vzLK_xE9Zy8 on an intriguing new approach to DSLs. I'm unsure how widely applicable it is, but when it does work, it combines the best features of "shallow" DSLs and "deep" DSLs.

purpleidea commented 6 years ago

@blynn Please excuse my embarrassingly long delay in replying. I was working through the "dependency DAG" in my brain, and since you linked me to a video, I added that in as a pre-req to replying, and other life things were dependencies that needed resolving before I could watch it, and here we are...

On to FRP: I'm also quite new to it, and the mgmt language is my first FRP, but I've learned a lot in doing it, and have even needed to rewrite certain parts based on stuff I've learned writing the earlier implementations. If you'd like to be involved, I'd certainly value your help! There's an interesting FRP "execution" problem that needs solving in https://github.com/purpleidea/mgmt/ which I think you'd find quite interesting. It's currently only described in my brain, but if that kind of fun theoretical FRP thing interests you I can try and write it up.

Lastly, when I decided on golang for my implementation of mgmt, it was not without difficulty. I almost chose haskell or rust, but ultimately they were missing things I needed. I'm glad I chose golang, and I hope a lot of my interesting future work happens in the FRP DSL.

Would love to chat more if you're interested. IRC or email are probably better. I'm purpleidea on Freenode and at gmail.

Cheers!

blynn commented 6 years ago

I do want to learn more about configuration management, but before I even think about working on it myself, I want to learn Nix and NixOS. I'm impressed with their approach (e.g. https://nixos.org/~eelco/pubs/nspfssd-lisa2004-final.pdf and https://nixos.org/~eelco/pubs/nixos-jfp-final.pdf) and also by what I've seen people do with Nix.

purpleidea commented 6 years ago

@blynn Cool. Since terminology is quite often misunderstood in this area, I'll point out one or two things which are hopefully helpful to you:

Hope this was interesting, and would love to see you join the project.

blynn commented 6 years ago

I see, so it's something like Kubernetes? I must confess it's not my area of interest. At my old job, dedicated teams of engineers worked on configuration management, so the rest of us didn't have to think about it. But now that I've left, maybe I'll be forced to take an interest sooner or later!

On Tue, Aug 7, 2018 at 1:57 AM, James notifications@github.com wrote:

@blynn https://github.com/blynn Cool. Since terminology is quite often misunderstood in this area, I'll point out one or two things which are hopefully helpful to you:

-

Nix/etc, while I'm personally not using them, AIUI, they are concerned with package management for the single machine use cases, where as in https://github.com/purpleidea/mgmt/ we are interested in the distributed system of machines, which can obviously simplify down to a single host.

We have "resources" in mgmt. Currently there is a pkg resource: https://github.com/purpleidea/mgmt/blob/master/engine/resources/pkg.go https://github.com/purpleidea/mgmt/blob/master/engine/resources/pkg.go and it is based on the packagekit library. It does have nix support ( hughsie/PackageKit#126 https://github.com/hughsie/PackageKit/pull/126 ) but I've never tested it. That's how these things would fit together. Perhaps there are even more nix features or resources needed.

I believe the correct term for what I'm working on is "configuration management", but some people believe it should really be "choreography" because of the multi-machine coordination (via FRP) that I'm doing in the network. The world is not orchestration, because by definition, orchestration implies a central point (of failure) which executes commands in order, like Ansible, for example. We instead use distributed algorithms to ensure the desired distributed state.

Hope this was interesting, and would love to see you join the project.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/blynn/nex/issues/50#issuecomment-410985485, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAOWGw3TMweBOzLZD6nrQ0SMEUDy3AAks5uOVZtgaJpZM4VGiAk .

purpleidea commented 6 years ago

@blynn It's quite different than kubernetes in many ways, but there is some overlap in so far as that they're both used for infrastructure related things. Put another way, if you were determined to use kubernetes (which I personally think is usually not a good idea) then mgmt could be used to set it up and manage it.

If you're interested and want to learn more, ping me on IRC or something. Cheers!