SonOfLilit / kleenexp

modern regular expression syntax everywhere with a painless upgrade path
MIT License
73 stars 16 forks source link

Inline flags #25

Open SonOfLilit opened 2 years ago

SonOfLilit commented 2 years ago

Implement syntax for inline flags, e.g.:

[ignore_case "hello, world, my name is " [ignore_case:unset #uppercase [1+ #lowercase]]]

Should compile to:

(?i)hello, world, my name is (?-i:[A-Z][a-z]+)

See https://docs.python.org/3/library/re.html section (?aiLmsux-imsx:...)

PrakritiRaw commented 2 years ago

can you please elaborate the issue a little more.

SonOfLilit commented 2 years ago

Python supports a feature called "inline flags", which you can read about in the python regular expression docs.

See https://docs.python.org/3/library/re.html section (?aiLmsux-imsx:...)

Currently, there is no syntax in Kleenexp that would compile to this regex syntax.

This issue proposes syntax for this. e.g.

[ignore_case "hello, world, my name is " [ignore_case:unset #uppercase [1+ #lowercase]]]

Should compile to:

(?i)hello, world, my name is (?-i:[A-Z][a-z]+)

This would involve adding a line to builtin_operators in compile.py, adding a class to asm.py that implements an inline flag, and implementing methods like to_regex() on it (see other similar asm classes like Capture or Lookahead).

Then we could do something similar for the rust compiler, the mapping is pretty straightforward but with different names (Regexable instead of Asm).

SonOfLilit commented 2 years ago

Names for all the settings:

[ascii_only ...]
[ignore_case ...]
[locale_dependent ...]
[multiline ...]
[any_matches_all ...]
[unicode ...]

We obviously don't need to support the verbose flag.

Yehuda-blip commented 1 year ago

Hey, I started implementing this feature - a few questions: a. I found a syntax error in asm.py on line 228, not really sure what that is or why it's there. b. Assuming I want to open a draft PR, should I push changes to here or to my own fork? I am not sure about the standards you are trying to withhold and some think that you might consider some of the computation I do more fitting for the compile stage rather than the assembly stage (mainly: validating that a flag cannot be set and unset in the same flag-setting. the expression re.compile(r"(?i-i:example)" raises "bad inline flags" and I figured that so should "ignore_case ignore_case:unset "example"). Since I'm new here (and to open-source as a whole) I figured somebody will have a lot of comments. c. Running the pytest command shows a NotImplemented failure on asm.py line 21 - again, not sure if that's meant to be there. d. I've yet to write any tests (or figure out how the testing works), but I have some code here: https://github.com/Yehuda-blip/kleenexp/tree/feature/add_inline_flags (yehudaKLEIN and yehuda-blip are both myself, I'll fix that problem at some point)

e. Almost forgot: There seems to be a special case for inline flags in the beginning of a regular expression that's considered as some global parameter to the compiling function (in the example above, that would be the '(?i)' in '(?i)hello...'. If I remember correctly a syntax node in the current implementation holds a position value that can be used in the special case parenthesis wrapping of these global flags, but since these flags are apparently meant to replace actual parameters to the re.compile(), and since the original re has a different syntax for these, I thought maybe a special syntax should be considered here? I figured it might prove useful when building the auto-convert tool for python (although very likely it will not be, just figured I'd ask).

SonOfLilit commented 1 year ago

Hi Yehuda, good to hear you're trying your hand at this. Welcome to kleenexp!

a. Seems like someone accidentally pressed Enter while a symbol was selected. Fixed.

b. Fork, push to fork, open a PR against my repo. Specifically, I don't think we should raise for this case, just use the innermost setting, but I could be convinced otherwise.

c. I'm going to guess you're testing the rust compiler. Try KLEENEXP_RUST=0 pytest, do you still get an error?

d. Look at test_api.py for how to write the simplest kind of test, e.g. test_ip() has both test of ke.re() and of ke.match().

e. As far as I understand, we never need this special case, right? Nothing wrong will happen if we only implement the local version and not the global version, and nothing will be felt as missing by our users?

Yehuda-blip commented 1 year ago

b. Innermost setting it is.

c. I know for a fact that I'm running python tests because my tests keep breaking :) the error is still there. image

e. From what I can see there is no way to use an operator outside of a "[]", which means that it's impossible to set a flag and for it to "feel" global - however I don't think this is reason enough to add special Behavior.

f. Just to make sure I'm not missing something, It IS impossible to use an operator outside box braces or without anything to operate on, right? good: [ascii_only "operator input"] bad: [ascii_only], [ascii_only "operator input" ignore_case "operator input 2"] , [ascii_only ignore_case operator input 2] not-an-operator: ascii_only "operator input"

g. There are some changes on flag compatibilty between python 3.6 and 3.7. I'm currently writing under the assumption that this feature is compatible with python 3.10 (meaning it is compatible with 3.7 and does not support the weird noflag thing).

h. Again - almost forgot: In the last 4 hours I learned that a string being a bytes-like affects the validity of locale_dependent and unicode flags. I also saw that there is some implementation regarding passing flags to the compile function, but I didn't see any validations on this matter. I'm not about to deal with it in this PR, but I figured it's worth noting.

SonOfLilit commented 1 year ago

c. I think this line is your fix to the typo you found in (a)? Use my fix instead (return False) and it will work.

e. Agreed. This is a design mistake in Regex that we're fixing.

f. I don't remember if bad[0] is always a syntax error or sometimes just meaningless, but otherwise you're correct.

g. I just went over all the "Changed in version 3.x" notes in the docs, none of them seems to affect us in the sense that users will get the expected behavior from their version (e.g. flags are correctly ints in old versions and RegexFlag in new versions, NOFLAG behaves identically in all 3.x since it was always just an alias for 0 and I define it this way if it doesn't exist in the re module)

h. If a user tries to use LOCALE with a non-byteslike, or pass in LOCALE | ASCII, they will get an error from re.compile(), as intended. We aim for full bug-for-bug compatibility by wrapping the existing API.

SonOfLilit commented 1 year ago

Also, you're being very thorough, that's a good habit to have!

Yehuda-blip commented 1 year ago

g + h: My PR includes some validations on the flags locale_dependent, ascii_only and unicode. From what I understand, you would want these removed, and the error messages should come from regex and not kleenexp?

SonOfLilit commented 1 year ago

I'll have to have a look at the code before I form an opinion, but my gut instinct is to rely on the re validations and maybe just wrap their exception with a CompileError.