VerbalExpressions / implementation

In this repo we will document how each method in the library should behave
34 stars 2 forks source link

Add named group support #6

Open jehna opened 8 years ago

jehna commented 8 years ago

As implemented in Python repo's pull request by @andrii1812

https://github.com/VerbalExpressions/PythonVerbalExpressions/pull/16

jehna commented 8 years ago

Named groups is something that is missing in at least JavaScript, so this could potentially be an issue with cross-language support.

roger- commented 8 years ago

How about something like:

tester = (verbal_expression.
            start_of_line().
               find('http').
               maybe('s').
               find('://').
               start_group('address').
                  maybe('www.').
                  anything_but(' ').
               end_group('address').
            end_of_line()
)
jehna commented 8 years ago

I was thinking more of like the Python implementation:

VerEx().
  start_of_line()
  .group("protocol",
    VerEx()
      .find('http')
      .maybe('s')
      .find('://')
  )
  .group("domain",
    VerEx()
      .maybe('www.')
      .anything_but(' ')
  )
  .end_of_line()

The real problem is, that some languages (e.g. JavaScript) do not natively support named groups for regex, like python does. The following would be implemented in vanilla python regex like:

^(?P<protocol>https?://)(?P<domain>(www\.)?[^\ ]*)$

Which allows us to ask values for protocol and domain groups.

So getting this to function in languages that don't support named groups would be awesome.

roger- commented 8 years ago

Not bad, but I'm not a fan of the dangling, JavaScript-esque parentheses.

jehna commented 8 years ago

Also adding group members as arguments would kind of be against how the rest of the library works. (all other ones working with xx().xx() but groups being an exception to work xx(xx()))

How about automatically wrapping the preceding "ungrouped" part automatically with group()?

Like:

VerEx()
  .start_of_line()
  .find('http')
  .maybe('s')
  .find('://')
  .group("protocol")
  .maybe('www.')
  .anything_but(' ')
  .end_of_line()
  .group("domain")

that would compile to:

(?P<protocol>^https?://)(?P<domain>(www\.)?[^\ ]*$)
roger- commented 8 years ago

I think it's a little clearer to label the start/end of a group instead of just the end.

jehna commented 8 years ago

Hmm.. But then there's a human error possibility, especially with large expressions. If a dev forgets to start/end one group, that can be hard to debug.

If there's just a general group() function, it will potentially create unnecessary groups; for example if you'd just like to capture the middle of a string, it would look something like this:

// Capture only names of John and Tracy
var words = ['Name: John', 'Something', 'Name: Tracy', 'Name: Bob'];
VerEx()
  .first('Name: ')
  .group()
  .maybe('John')
  .maybe('Tracy')
  .group("name")

that would create an unnecessary group around the first('Name: '), but it would always compile, even if you removed either of the groups

roger- commented 8 years ago

How about begin_group('address') ... end_group('address')?

jehna commented 8 years ago

But consider if you'd have a big expression with lots of things going on. You start modifying the code and somewhere between the copy-pastes you lose a begin_group('address') line. How should the code work? Should it? What would happen if you typo a group's name to other of the function calls? How would you debug your code?

I really think using just the single group method would answer pretty logically to all of these questions.

roger- commented 8 years ago

I don't think we need to worry about cases where people mess up their code, there's not much we can do about that. That said, if there's an end_group('name') with no matching start_group('name') (and vice versa) then that's easy to detect and trigger an exception.

I just think it should be clear that a group has a well defined start and end. Also a single group() probably won't work when you have nested groups.

mihai-vlc commented 8 years ago

I would also vote for an approach with 2 methods. It makes it more obvious.

VerEx()
  .start_of_line()
  .find('http')
  .maybe('s')
  .find('://')
  .open_group('domain')
    .maybe('www.')
    .anything_but(' ')
    .end_of_line()
  .close_group();

In this case you specify the name only when you open it.

To avoid issues in situations like the one below we could add the group wrapper to the expression only when you call the close_group() method.

VerEx()
  .start_of_line()
  .find('http')
  .maybe('s')
  .find('://')
  .open_group('domain')
    .maybe('www.');
// this would compile to: /(?:http)(?:s)?(?:\:\/\/)(?:www\.)?/gm

For debugging maybe we could add a helper method like check_groups that will throw an error if any previous opened group is not closed.

lukewestby commented 8 years ago

Given the variety of languages that implement VerbalExpressions I just want to throw in my opinion that I hope we can avoid functions that throw, as some languages don't have a concept of exceptions or try/catch and so the failure case would have to be lifted, adding friction to the actual use of the API

roger- commented 8 years ago

For debugging maybe we could add a helper method like check_groups

I'd do that by default just to cut down on errors like @jehna mentioned.

I also think it's better to return lists of dicts (or namedtuples?) with results instead of SRE_Match objects.

jehna commented 8 years ago

Or we could just omit the whole group method for now and just make the then and maybe accept VerEx objects as arguments. Could this cause issues in strongly typed languages?

jehna commented 8 years ago

Also a bit related: https://github.com/VerbalExpressions/PurescriptVerbalExpressions/issues/2

sharkdp commented 8 years ago

Also a bit related: VerbalExpressions/PurescriptVerbalExpressions#2

Thanks for the reference. I have implemented this now. It allows to write something like this:

pattern = do
  find "start"
  some whitespace
  possiblyV do
    find "middle"
    some whitespace
  find "end"

I realize it probably doesn't help this discussion here too much, but its a nice example where a monadic interface has some advantages over the JS-style method-chaining interface, because we can also bind to intermediate names:

pattern = do
  firstWord <- capture word
  whitespace
  capture word
  whitespace
  findAgain firstWord

This pattern matches "foo bar foo" but not "foo bar baz".

The expression x <- capture inner creates a new capture group for the inner expression and binds the index of this capture group to the name x.

This also allows us to "simulate" named groups for purposes of replacing. For more details, see the Replacing with named groups example.