Open jehna opened 8 years ago
Named groups is something that is missing in at least JavaScript, so this could potentially be an issue with cross-language support.
How about something like:
tester = (verbal_expression.
start_of_line().
find('http').
maybe('s').
find('://').
start_group('address').
maybe('www.').
anything_but(' ').
end_group('address').
end_of_line()
)
I was thinking more of like the Python implementation:
VerEx().
start_of_line()
.group("protocol",
VerEx()
.find('http')
.maybe('s')
.find('://')
)
.group("domain",
VerEx()
.maybe('www.')
.anything_but(' ')
)
.end_of_line()
The real problem is, that some languages (e.g. JavaScript) do not natively support named groups for regex, like python does. The following would be implemented in vanilla python regex like:
^(?P<protocol>https?://)(?P<domain>(www\.)?[^\ ]*)$
Which allows us to ask values for protocol
and domain
groups.
So getting this to function in languages that don't support named groups would be awesome.
Not bad, but I'm not a fan of the dangling, JavaScript-esque parentheses.
Also adding group members as arguments would kind of be against how the rest of the library works.
(all other ones working with xx().xx()
but groups being an exception to work xx(xx())
)
How about automatically wrapping the preceding "ungrouped" part automatically with group()
?
Like:
VerEx()
.start_of_line()
.find('http')
.maybe('s')
.find('://')
.group("protocol")
.maybe('www.')
.anything_but(' ')
.end_of_line()
.group("domain")
that would compile to:
(?P<protocol>^https?://)(?P<domain>(www\.)?[^\ ]*$)
I think it's a little clearer to label the start/end of a group instead of just the end.
Hmm.. But then there's a human error possibility, especially with large expressions. If a dev forgets to start/end one group, that can be hard to debug.
If there's just a general group()
function, it will potentially create unnecessary groups; for example if you'd just like to capture the middle of a string, it would look something like this:
// Capture only names of John and Tracy
var words = ['Name: John', 'Something', 'Name: Tracy', 'Name: Bob'];
VerEx()
.first('Name: ')
.group()
.maybe('John')
.maybe('Tracy')
.group("name")
that would create an unnecessary group around the first('Name: ')
, but it would always compile, even if you removed either of the group
s
How about begin_group('address')
... end_group('address')
?
But consider if you'd have a big expression with lots of things going on. You start modifying the code and somewhere between the copy-pastes you lose a begin_group('address')
line. How should the code work? Should it? What would happen if you typo a group's name to other of the function calls? How would you debug your code?
I really think using just the single group
method would answer pretty logically to all of these questions.
I don't think we need to worry about cases where people mess up their code, there's not much we can do about that. That said, if there's an end_group('name')
with no matching start_group('name')
(and vice versa) then that's easy to detect and trigger an exception.
I just think it should be clear that a group has a well defined start and end. Also a single group()
probably won't work when you have nested groups.
I would also vote for an approach with 2 methods. It makes it more obvious.
VerEx()
.start_of_line()
.find('http')
.maybe('s')
.find('://')
.open_group('domain')
.maybe('www.')
.anything_but(' ')
.end_of_line()
.close_group();
In this case you specify the name only when you open it.
To avoid issues in situations like the one below we could add the group wrapper to the expression only when you call the close_group()
method.
VerEx()
.start_of_line()
.find('http')
.maybe('s')
.find('://')
.open_group('domain')
.maybe('www.');
// this would compile to: /(?:http)(?:s)?(?:\:\/\/)(?:www\.)?/gm
For debugging maybe we could add a helper method like check_groups
that will throw an error if any previous opened group is not closed.
Given the variety of languages that implement VerbalExpressions I just want to throw in my opinion that I hope we can avoid functions that throw, as some languages don't have a concept of exceptions or try/catch and so the failure case would have to be lifted, adding friction to the actual use of the API
For debugging maybe we could add a helper method like check_groups
I'd do that by default just to cut down on errors like @jehna mentioned.
I also think it's better to return lists of dicts (or namedtuples?) with results instead of SRE_Match
objects.
Or we could just omit the whole group
method for now and just make the then
and maybe
accept VerEx objects as arguments. Could this cause issues in strongly typed languages?
Also a bit related: https://github.com/VerbalExpressions/PurescriptVerbalExpressions/issues/2
Also a bit related: VerbalExpressions/PurescriptVerbalExpressions#2
Thanks for the reference. I have implemented this now. It allows to write something like this:
pattern = do
find "start"
some whitespace
possiblyV do
find "middle"
some whitespace
find "end"
I realize it probably doesn't help this discussion here too much, but its a nice example where a monadic interface has some advantages over the JS-style method-chaining interface, because we can also bind to intermediate names:
pattern = do
firstWord <- capture word
whitespace
capture word
whitespace
findAgain firstWord
This pattern matches "foo bar foo" but not "foo bar baz".
The expression x <- capture inner
creates a new capture group for the inner
expression and binds the index of this capture group to the name x
.
This also allows us to "simulate" named groups for purposes of replacing. For more details, see the Replacing with named groups example.
As implemented in Python repo's pull request by @andrii1812
https://github.com/VerbalExpressions/PythonVerbalExpressions/pull/16