GrandMoff100 / RegexFactory

Dynamically construct regex patterns with native python representations
https://regexfactory.readthedocs.io
GNU General Public License v3.0
7 stars 4 forks source link

Feature Suggestion: Add union / intersect / subset operators to the Regex Classes #17

Open jeffrey82221 opened 1 year ago

jeffrey82221 commented 1 year ago

Hi @GrandMoff100 ,

I would like to suggest a new feature for this package. Since a regex is a union of multiple patterns, it can be considered a set. How about building union (or) / intersect (and) / is subset of (in) operators to each regex classes?

For example:

Range("1", "5") | Range("4", "8")
-> Range("1", "8")

Range("1", "5") & Range("4", "8")
-> Range("4", "5")

Range("3", "4") in Range("1", "7")
-> True

I am trying to build a regex inference engine. This feature will be beneficial for organizing the regex(es) when the size of the regex object list is large. I wonder if it's possible to add this feature? I am willing to make this contribute.

GrandMoff100 commented 1 year ago

Ooooh that's an interesting feature! I think it's definitely possible. We can define custom dunder methods like __or__ and __and__ and __contains__ for the syntax you suggested. We'd also need some type checking in these methods because you can't union the patterns abc+ and [a-z] because the former isn't a really a set. What would the output of Range("1", "5") | Range("8", "10") be? Seeing as there's a gap interval between 5 and 8 that we'd have to exclude. Secondly, would those behavior only apply to Range? If not what would you like the behavior to be for other classes?

jeffrey82221 commented 1 year ago

Let's first focus on the __or__ operator, since it can help simplify the regex and make regex more generalize.

My suggestion is that we can build up rules for those regex pairs that can intuitively merged together and simply return Or or Set regex class for those that is difficult to merge.

For the cases you mentioned about:

  1. A: 'abc+'; B: r'[a-z]' -> A | B = ? -> A is a patterns ab follows by more than one c. B is a-z. -> A and B are two entirely different patterns -> Result is Or('abc+', '[a-z]')

  2. A: Range("1", "5"); B: Range("8", "9") -> A | B = ? -> A is 1-5, B is 8-9. -> A and B does not overlap -> Result is Set(Range("1", "5"), Range("8", "9"))

Of course, we need to specify the behavior between all the classes. It can be hard to elaborate, but let me try to explain my first idea.

We can first categorize the regex classes into char-level classes such as Range, Set, NotSet, and Special Chars (Any, WORD, DIGIT, ...) and compositional classes such as Amount, Multi, Optional, Or.

For the char-level classes, we can do the A | B = C operation in the following steps:

  1. Convert the two regex A and B into two groups of ascii code.
  2. Obtain the union group of the two ascii code.
  3. Identify consecutive ascii code in the group and separate the group into sub-groups of consecutive ascii code.
  4. Convert sub-groups with more than 3 element into Range regex.
  5. Convert sub-groups that match the special chars into special chars. (e.g., [0-9] -> \d)
  6. Merge the sub-groups and the Range and special chars with Set.

For the compositional regex, Amount, Multi, Optional.

  1. Check if the containing pattern are the same between A and B. If not, return Or(A, B).
  2. Check by the counting parameters (e.g., i and j in Amount) imply that the composition is overlapping. If not, return Or(A, B).
  3. Return Amount, Multi, or Optional with the correct counting parameters according to the way A and B are merged.

For Or, return Or, too.

Currently, I have no idea for the advance classes that inherent Extension. (e.g., NamedGroup)

GrandMoff100 commented 1 year ago

Sounds good. I'll review your PR when you're finished. I'll also add set operations for the advanced classes too before that. 👍