Open Alan-Chen99 opened 6 months ago
Oh interesting, I didn't know that ^
had a higher precedence than |
.
I guess I can have a precedence system and surround a pattern with a non capturing group in the __radd__
method when we append a character that has a higher precedence than itself.
I'm on it.
Thanks for posting this btw!
@Alan-Chen99 try version regexfactory==1.0.1
Thanks!
I think "operations" also need to assign precedence. For ex
import re
from regexfactory import *
r = str(Or("a", "b") + Or("x", "y"))
print(repr(r))
print(re.search(r, "a"))
outputs
'(?:a)|(?:b)(?:x)|(?:y)'
<re.Match object; span=(0, 1), match='a'>
The precedence of a raw regex also need to be lower:
import re
from regexfactory import *
r = str(RegexPattern("^") + RegexPattern("a|b"))
print(repr(r))
print(re.search(r, "-b"))
outputs
'^a|b'
<re.Match object; span=(1, 2), match='b'>
I think two things need to be done:
@GrandMoff100
Could you provide examples of what regex you would have it generate ideally in specific scenarios? This will help me implement the functionality you're looking for.
On another note: RegexPattern is not a class intended to be instantiated directly. Until I develop a parser system to parse raw regex strings into a tree of RegexFactory objects I don't intend to provide any support for "raw" regex strings. Without that parser system I don't want to restructure the parent class, RegexPattern, to take on a whole new functionality as a raw regex pattern class because it doesn't make sense for the children to inherit that raw string functionality.
However, the example you sent with the Or's being concatenated does look a little wonky so I will look into that in the next few days.
Looking at the Or example more closely it looks like the b and x patterns are getting interpreted as a merged Or option. So instead of the compiled string being a two character pattern with two options per character I think it might be interpreting the pattern as "a" or "bx" or "y" with three cases. I need to confirm this, but if I'm right then this shouldn't happen and I think I just need to implement a group around Or's specifically when they get concatted. Rather than creating a precedence system for operations which I don't entirely understand how would work.
I made a PR which assigns a "_precedence" to a RegexPattern, which is the precedence of the "root node" if one would to parse the underlying regex.
It generates some excessive parenthesis, for ex ANCHOR_START + "ab"
now returns '(?:^)ab'. the emacs rx library appears to solve this by
(see https://github.com/emacs-mirror/emacs/blob/master/lisp/emacs-lisp/rx.el)
;; The `rx--translate...' functions below return (REGEXP . PRECEDENCE),
;; where REGEXP is a list of string expressions that will be
;; concatenated into a regexp, and PRECEDENCE is one of
;;
;; t -- can be used as argument to postfix operators (eg. "a")
;; seq -- can be concatenated in sequence with other seq or higher (eg. "ab")
;; lseq -- can be concatenated to the left of rseq or higher (eg. "^a")
;; rseq -- can be concatenated to the right of lseq or higher (eg. "a$")
;; nil -- can only be used in alternatives (eg. "a\\|b")
;;
;; They form a lattice:
;;
;; t highest precedence
;; |
;; seq
;; / \
;; lseq rseq
;; \ /
;; nil lowest precedence
imo we should just use an extra group to keep stuff simpler
I don't intend to provide any support for "raw" regex strings
doesnt a literal str currently represent a "raw" regex?
doesnt a literal str currently represent a "raw" regex?
Yeah, I suppose concatenating a literal string would represent "raw" regex, but what I meant was that I don't intend to be responsible for behavior of regexfactory when users concat literal strings because there are so many edge cases and head-aches that I don't want to deal with. Now that I think about it, we should raise exception when we try to concat a non-RegexPattern object. (i.e. ANCHOR_START + "ab"
).
I think then I'd also add LiteralString
class as well in the case where you want to match a specific string of text. (This would let us do something like ANCHOR_START + LiteralString("ab")
---> ^ab
. That would make my life easier when I parse raw regex expressions in RegexFactory components, to have a dedicated component for literal text patterns which keeps type consistency.
imo we should just use an extra group to keep stuff simpler
yeah, by that I presume you mean putting a group around the Or's when they get concatenated together? If so, that's what I wanted to do to begin with:
I think I just need to implement a group around Or's specifically when they get concatted.
extra group to keep stuff simpler
actually i meant '(?:^)ab'.
I thought one need to use the diamond above, but actually regex in python seem to work differently then elisp
it seems that i can treat "^" just as a normal char. not sure though, its marked as having a precedence below concatenation according to https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08, so if we follow that "^ab" is invalid and we need to make '(?:^)ab' but turns out python just treat ^ as a char?
If you wanted to match the string "^ab" literally you can use the pattern \^ab
you just have to escape the ^
results in
which is incorrect