GrandMoff100 / RegexFactory

Dynamically construct regex patterns with native python representations
https://regexfactory.readthedocs.io
GNU General Public License v3.0
7 stars 4 forks source link

adding does not properly group #18

Open Alan-Chen99 opened 6 months ago

Alan-Chen99 commented 6 months ago
import re

from regexfactory import *

r = str(ANCHOR_START + Or("abc", "xyz"))
print(repr(r))
print(re.search(r, "-xyz"))

results in

'^(?:abc)|(?:xyz)'
<re.Match object; span=(1, 4), match='xyz'>

which is incorrect

GrandMoff100 commented 6 months ago

Oh interesting, I didn't know that ^ had a higher precedence than |.

I guess I can have a precedence system and surround a pattern with a non capturing group in the __radd__ method when we append a character that has a higher precedence than itself.

I'm on it.

GrandMoff100 commented 6 months ago

Thanks for posting this btw!

GrandMoff100 commented 6 months ago

@Alan-Chen99 try version regexfactory==1.0.1

Alan-Chen99 commented 6 months ago

Thanks!

I think "operations" also need to assign precedence. For ex

import re

from regexfactory import *

r = str(Or("a", "b") + Or("x", "y"))
print(repr(r))
print(re.search(r, "a"))

outputs

'(?:a)|(?:b)(?:x)|(?:y)'
<re.Match object; span=(0, 1), match='a'>

The precedence of a raw regex also need to be lower:

import re

from regexfactory import *

r = str(RegexPattern("^") + RegexPattern("a|b"))
print(repr(r))
print(re.search(r, "-b"))

outputs

'^a|b'
<re.Match object; span=(1, 2), match='b'>

I think two things need to be done:

Alan-Chen99 commented 6 months ago

@GrandMoff100

GrandMoff100 commented 6 months ago

Could you provide examples of what regex you would have it generate ideally in specific scenarios? This will help me implement the functionality you're looking for.

On another note: RegexPattern is not a class intended to be instantiated directly. Until I develop a parser system to parse raw regex strings into a tree of RegexFactory objects I don't intend to provide any support for "raw" regex strings. Without that parser system I don't want to restructure the parent class, RegexPattern, to take on a whole new functionality as a raw regex pattern class because it doesn't make sense for the children to inherit that raw string functionality.

However, the example you sent with the Or's being concatenated does look a little wonky so I will look into that in the next few days.

GrandMoff100 commented 6 months ago

Looking at the Or example more closely it looks like the b and x patterns are getting interpreted as a merged Or option. So instead of the compiled string being a two character pattern with two options per character I think it might be interpreting the pattern as "a" or "bx" or "y" with three cases. I need to confirm this, but if I'm right then this shouldn't happen and I think I just need to implement a group around Or's specifically when they get concatted. Rather than creating a precedence system for operations which I don't entirely understand how would work.

Alan-Chen99 commented 6 months ago

I made a PR which assigns a "_precedence" to a RegexPattern, which is the precedence of the "root node" if one would to parse the underlying regex.

It generates some excessive parenthesis, for ex ANCHOR_START + "ab" now returns '(?:^)ab'. the emacs rx library appears to solve this by (see https://github.com/emacs-mirror/emacs/blob/master/lisp/emacs-lisp/rx.el)

;; The `rx--translate...' functions below return (REGEXP . PRECEDENCE),
;; where REGEXP is a list of string expressions that will be
;; concatenated into a regexp, and PRECEDENCE is one of
;;
;;  t    -- can be used as argument to postfix operators (eg. "a")
;;  seq  -- can be concatenated in sequence with other seq or higher (eg. "ab")
;;  lseq -- can be concatenated to the left of rseq or higher (eg. "^a")
;;  rseq -- can be concatenated to the right of lseq or higher (eg. "a$")
;;  nil  -- can only be used in alternatives (eg. "a\\|b")
;;
;; They form a lattice:
;;
;;           t          highest precedence
;;           |
;;          seq
;;         /   \
;;      lseq   rseq
;;         \   /
;;          nil         lowest precedence

imo we should just use an extra group to keep stuff simpler

Alan-Chen99 commented 6 months ago

I don't intend to provide any support for "raw" regex strings

doesnt a literal str currently represent a "raw" regex?

GrandMoff100 commented 6 months ago

doesnt a literal str currently represent a "raw" regex?

Yeah, I suppose concatenating a literal string would represent "raw" regex, but what I meant was that I don't intend to be responsible for behavior of regexfactory when users concat literal strings because there are so many edge cases and head-aches that I don't want to deal with. Now that I think about it, we should raise exception when we try to concat a non-RegexPattern object. (i.e. ANCHOR_START + "ab").

I think then I'd also add LiteralString class as well in the case where you want to match a specific string of text. (This would let us do something like ANCHOR_START + LiteralString("ab") ---> ^ab. That would make my life easier when I parse raw regex expressions in RegexFactory components, to have a dedicated component for literal text patterns which keeps type consistency.

imo we should just use an extra group to keep stuff simpler

yeah, by that I presume you mean putting a group around the Or's when they get concatenated together? If so, that's what I wanted to do to begin with:

I think I just need to implement a group around Or's specifically when they get concatted.

Alan-Chen99 commented 6 months ago

extra group to keep stuff simpler

actually i meant '(?:^)ab'.

I thought one need to use the diamond above, but actually regex in python seem to work differently then elisp

it seems that i can treat "^" just as a normal char. not sure though, its marked as having a precedence below concatenation according to https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08, so if we follow that "^ab" is invalid and we need to make '(?:^)ab' but turns out python just treat ^ as a char?

GrandMoff100 commented 6 months ago

If you wanted to match the string "^ab" literally you can use the pattern \^ab you just have to escape the ^