firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com
3.26k stars 199 forks source link

eregex (posix-extended) flavor #165

Open hammer opened 9 years ago

hammer commented 9 years ago

As implemented by Boost, for example.

firasdib commented 9 years ago

I'm pretty sure all of that is already implemented. Is there something specific?

hammer commented 9 years ago

I see only "pcre (php)", "javascript", and "python" in the "flavor" section.

firasdib commented 9 years ago

Yes, but POSIX-Extended regex falls under the PCRE-category.

On 06 Nov 2014, at 19:00, Jeff Hammerbacher notifications@github.com wrote:

I see only "pcre (php)", "javascript", and "python" in the "flavor" section.

— Reply to this email directly or view it on GitHub https://github.com/firasdib/Regex101/issues/165#issuecomment-62022306.

nhahtdh commented 9 years ago

@firasdib: POSIX's matching behavior is different from PCRE/JavaScript/Python. POSIX searches for left-most longest string, while most other flavors gives the left most string, which is the first string which matches the expansion of the search tree.

A simple example is a|aaa on input string "aaa". POSIX will always match "aaa" without fail, and other flavors will match "a".

By the way, collation is one feature available in POSIX which most other flavors are lacking of. The feature is powerful, and allows you to search for all characters similar to a with very simple syntax.

firasdib commented 9 years ago

I see. In that case I would need to compile a POSIX-regex engine down to JS. Is there a stand-alone C-library I can use, @nhahtdh ?

nhahtdh commented 9 years ago

@firasdib: You may want to take a look at Boost Regex library, but I don't know whether it implements any extension to the base specification or not, since the document at opengroup leaves some undefined behavior.

Another thing is the locale-specific behavior of POSIX regex, which is currently not implemented in the interface.

cheako commented 7 years ago

I'm not sure what you'r looking for, but I would start with glibc implementation.

  1. https://code.woboq.org/userspace/glibc/posix/regex_internal.c.html
  2. https://code.woboq.org/userspace/glibc/posix/regexec.c.html
cheako commented 7 years ago

I think regex support is included in emscripten libc/musl/src/regex, you just need to write a wrapper and compile it.

cheako commented 7 years ago

I've tested this out on a simple example and emcc did all on it's own build and include the backend functions: https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L12716 https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L7033 Example usage: https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L1943

cheako commented 6 years ago

Now that there is a Javascript implementation, how long until we can see this being added?

izxle commented 6 years ago

Is this still being developed?

cheako commented 6 years ago

I figured out what I needed by writing many small regex programs, brute force. Though I would still like to see this implemented in case I need it in the future.

firasdib commented 5 years ago

@cheako If you could explain the process more thoroughly, I could look into this again.

Is this related to https://github.com/firasdib/Regex101/issues/431 ?

cheako commented 5 years ago

That's great. All I had to do was use https://www.gnu.org/software/libc/manual/html_node/Matching-POSIX-Regexps.html and emscripten includes support automatically. As for if this is also the same as used in boost I wouldn't know.

Just keep in mind that there are a number of flags used when compiling posix regex, like to use an regex extension.

firasdib commented 5 years ago

@cheako Thanks. I'll have to read up and see what the implications of implementing this flavor is.

pke commented 4 years ago

Just ran into this when I tried my fav regex tester with a POSIX regex and it validated \d+ which is not valid in POSIX regex. It should be [0-9]+ or [[:digit:]]

So this tool needs a true POSIX flavour ;)

Roy-Orbison commented 3 years ago

Getting that 7-year itch? :hourglass:

cheako commented 3 years ago

Hey, now this could be implemented simply with wasm. If it wasn't seen previously as simple.

tgross35 commented 2 years ago

I wasted my time putting a good comment in #1230 before I saw this issue. So I'll just copy it to here:

I can probably add some context here (in addition to an upvote for the feature request). From the grep man page:

grep understands three different versions of regular expression syntax: “basic” (BRE), “extended” (ERE) and “perl” (PCRE). In GNU grep there is no difference in available functionality between basic and extended syntaxes. In other implementations, basic regular expressions are less powerful. The following description applies to extended regular expressions; differences for basic regular expressions are summarized afterwards. Perl-compatible regular expressions give additional functionality, and are documented in pcresyntax(3) and pcrepattern(3), but work only if PCRE is available in the system.

The rest of that page goes on to document BRE vs. ERE and is a great reference for anybody looking. Long story short, the only difference between the two is escaping and handling of special characters (?, +, (), {}, |). Support is as follows:

  • awk: only supports ERE
  • sed: supports BRE (default) or ERE (with -E)
  • grep: supports BRE (default) or ERE (with -E). May support PCRE (with -P) but this is pretty uncommon; bourne shell doesn't support it

BRE (default) and ERE (activated with the -E flag) are the two implementations that are always available, and are also used for sed. -P for PCRE is out there but pretty uncommon, and not available with sed or awk anyway.

So, long story short, adding BRE and ERE support to the best string-related resource out there would be amazing!

The wasm idea really isn't bad. The usual rust regex parser is closer to perl than BRE/ERE and I don't see any other implementations, but that would honestly be kind of fun to code.

UrsineRaven commented 1 year ago

I just want to add my support for adding both POSIX BRE and ERE. Relevant excerpt from re_format(7) man page:

REGEX(7)                            Linux Programmer's Manual                            REGEX(7)

NAME
       regex - POSIX.2 regular expressions

DESCRIPTION
       Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly
       those of egrep; POSIX.2 calls these "extended" REs) and obsolete  REs  (roughly  those  of
       ed(1); POSIX.2 "basic" REs).  Obsolete REs mostly exist for backward compatibility in some
       old programs; they will be discussed at the end.  POSIX.2 leaves some aspects of RE syntax
       and  semantics open; "(!)" marks decisions on these aspects that may not be fully portable
       to other POSIX.2 implementations.
knollet commented 1 year ago

Yes, but POSIX-Extended regex falls under the PCRE-category.

On 06 Nov 2014, at 19:00, Jeff Hammerbacher notifications@github.com wrote: I see only "pcre (php)", "javascript", and "python" in the "flavor" section. — Reply to this email directly or view it on GitHub #165 (comment).

no, it doesn't

grafik the second line should match (and in fact does with libregex), and the parenthesised group should just be t. In regex101 it doesn't match with any flavour.

firasdib commented 1 year ago

Did you include the g flag?

I've never encountered the {m}? pattern before, what is that expected to match exactly?

cheako commented 1 year ago

I haven't looked into this, but just a plain reading and I see. The first line captures 51 chars and the second line only captures 1. For the pattern to make sense, though, it should end in $?

I remember from when I was trying to use this, I tested a good number of settings and couldn't find one that matched experiment. If you feel a certain set of settings matches the behavior, please add an alias for those settings. That way, ppl won't have to go back and forth with you as the settings are further refined. I believe the power of this tool is the confidence using it gives, if users are required to guess settings that say one thing but are meant to mean another that removes this valuable feature.

Roy-Orbison commented 1 year ago

@pke

Just ran into this when I tried my fav regex tester with a POSIX regex and it validated \d+ which is not valid in POSIX regex. It should be [0-9]+ or [[:digit:]]

On the day you commented that, the current release of boost was v1.73.0 which supports \d. However, it's not universal so regex101 would probably need to allow such escape sequences, but put a warning in the Explanation panel.

@knollet You have an off-by-one error. Your test pattern matches 51 characters in the subgroup, not 50: https://regex101.com/r/A12aQJ/1

@firasdib The {m}? is valid syntax for making the previous atom ungreedy, but since {m} is a fixed length, the ? does nothing. It would have to be {m,n}? or have an extra grouping like (.{50})? to have any effect.


Still hoping ERE will be supported one day.

cheako commented 1 year ago

@pke

You have an off-by-one error. Your test pattern matches 51 characters in the subgroup, not 50: https://regex101.com/r/A12aQJ/1

https://jdoodle.com/ia/Ktj

I did test this, I got .{2}?. is either 3 or 1 in size.

// Online C compiler to run C program online
#include <stdio.h>

#include <string.h>

#include <regex.h>

int main() {
  char * source_a = "1234";
  char * source_b = "12";
  char * regexString = "(.{2}?.).*$";
  size_t maxGroups = 2;

  regex_t regexCompiled;
  regmatch_t groupArray[maxGroups];

  if (regcomp( & regexCompiled, regexString, REG_EXTENDED)) {
    printf("Could not compile regular expression.\n");
    return 1;
  };

  {
    char * source = source_a;
    if (regexec( & regexCompiled, source, maxGroups, groupArray, 0) == 0) {
      unsigned int g = 0;
      for (g = 0; g < maxGroups; g++) {
        if (groupArray[g].rm_so == (size_t) - 1)
          break; // No more groups

        char sourceCopy[strlen(source) + 1];
        strcpy(sourceCopy, source);
        sourceCopy[groupArray[g].rm_eo] = 0;
        printf("Group %u: [%2ld-%2ld]: %s\n",
          g, groupArray[g].rm_so, groupArray[g].rm_eo,
          sourceCopy + groupArray[g].rm_so);
      }
    }
  }

  {
    char * source = source_b;
    if (regexec( & regexCompiled, source, maxGroups, groupArray, 0) == 0) {
      unsigned int g = 0;
      for (g = 0; g < maxGroups; g++) {
        if (groupArray[g].rm_so == (size_t) - 1)
          break; // No more groups

        char sourceCopy[strlen(source) + 1];
        strcpy(sourceCopy, source);
        sourceCopy[groupArray[g].rm_eo] = 0;
        printf("Group %u: [%2ld-%2ld]: %s\n",
          g, groupArray[g].rm_so, groupArray[g].rm_eo,
          sourceCopy + groupArray[g].rm_so);
      }
    }
  }
  regfree( & regexCompiled);

  return 0;
}
Group 0: [ 0- 4]: 1234
Group 1: [ 0- 3]: 123
Group 0: [ 0- 2]: 12
Group 1: [ 0- 1]: 1
Roy-Orbison commented 1 year ago

@cheako I meant in @knollet's PCRE example, the ? does not behave as a {0,1} shortcut. It's acting as an ungreedy modifier for an atom that cannot match fewer than 50 characters, so it shouldn't be expected that the () group will match 50 characters (because of the extra .). That makes the assertion "the second line should match" untrue.

Clearly the behaviour in ERE is different, as you demonstrated, which adds to the case for a separate ERE flavour.

knollet commented 1 year ago

... so it shouldn't be expected that the () group will match 50 characters (because of the extra .). That makes the assertion "the second line should match" untrue.

I didn't say I expected the first group to match 50. I know it matches 51 OR 1. I said, the second line should match completely. The parenthesized group matches t (1 char, the last . in the group), and the .* (greedily) matches the rest. A $ for it to "make sense" is not really neccessary, it only may be for some use cases, which I am not talking about here.

@cheako I meant in @knollet's PCRE example, the ? does not behave as a {0,1} shortcut. It's acting as an ungreedy modifier

How would it be possible for ? to act as an ungreedy modifier on something, which in itself has no options? x{50} matches exactly 50 xes. Not "one OR more" as +, which you could modify with "as few as possible" or "as many as possible".

But ok, the very similar x{1,50} on the other hand would make sense to modify. I would have to check if it is. Still, there is a certani behaviour of posix-extended which is not represented in any of these flavours.

knollet commented 1 year ago

@firasdib

Did you include the g flag?

Yes. But I don't see what difference that should make. Even without g as meaning "global" giving all the matches instead of only the first should still return the ... first. But it returns none.

Roy-Orbison commented 1 year ago

@knollet I misunderstood you. You meant ‘If PCRE and ERE had equivalent matching, the second line would match in this PCRE example, but it doesn't.’ We can verify your example with grep as an ERE:

grep -E '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567

then PCRE:

grep -P '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 1234567
cheako commented 1 year ago

I don't get how that test is accurate, you are not printing the capture group.

I'm unsure where grep(is that a GNU extension?) gets its PCRE *implementation so for those playing along at home YMMV, if you don't get good results you can try this emulator: https://www.tutorialspoint.com/linux_terminal_online.php

Roy-Orbison commented 1 year ago

@cheako It doesn't need to print the subgroup, it's a logical conclusion. The PCRE test simply doesn't match the line that's only 50 characters (ending in 6), but the ERE does. The pattern as a whole is greedy (in both flavours) because of the trailing .* so can only ever match once per line. Therefore, the ERE matching the line at all means it must be capturing only 1 character in the group, followed by ‘the rest’. This agrees with the findings of your C program.

According to this post, grep has had -P support, using libpcre for about 23 years, so its man page warning about it being "experimental" means it's been a very long experiment. Regardless, Perl is not PCRE, so if you distrust grep, you can get the same result directly from PCRE using pcregrep or pcre2grep:

pcre2grep '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 1234567

I also found this article that suggests the {m}? construct may be undefined behaviour but, much like grep's "expiremental" -P, is probably consistent because maintainers of old software don't like breaking compatibility.

knollet commented 1 year ago

According to this post, grep has had -P support, using libpcre for about 23 years, so its man page warning about it being "experimental" means it's been a very long experiment. Regardless, Perl is not PCRE, so if you distrust grep, you can get the same result directly from PCRE using pcregrep or pcre2grep

This is just madness

jxu commented 1 year ago

If this hasn't been added in nine years and it's not open source, the author won't implement it. Unfortunately I haven't found an open source tool as good.

firasdib commented 1 year ago

@jxu: if you can transpile it to js, I can work with it. Vänliga hälsningar / Best regards,Firas DibOn 5 Aug 2023, at 04:25, jxu @.***> wrote: If this hasn't been added in nine years and it's not open source, the author won't implement it. Unfortunately I haven't found an open source tool as good.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

cheako commented 1 year ago

@firasdib That was done, but honestly at this point it would be better to compile the C into wasm. Either solution would work. Start with the connecting tissue between the POSIX API and the internal representation.

AverseABFun commented 6 months ago

Is this still being worked on? I love regex101, but I am examining the xz backdoor and it would be VERY helpful to be able to parse grep regexes

tcodes0 commented 2 months ago

just another me too comment

tgross35 commented 2 months ago

If anybody wants to help move this along, it is possible to provide an implementation that the author will be able to pick up.

I think the easiest way to support BRE and ERE is probably to fork my repo, which was used to add Rust to regex101, (https://github.com/tgross35/wasm-regex), and add new entrypoints that make use of the onig crate (bindings to Oniguruma). Use Syntax::posix_basic() or Syntax::posix_extended() for BRE/ERE.

Swapping regex implementations within my repo should be easy since only the engine needs to change, the annoying parts stay the same (UTF8 <-> UTF16 transcoding, UTF16 offset calculation, serialization, interfaces, list replacements, flags, CI, building for WASM, size minimization ...).

The onig crate also provides a ton of other flavors - emacs, ruby, perl, python, GNU, java. In theory you could probably replace the existing Java and Python implementations if their WASM is heavy 😄.

See also the discussion we had adding Rust https://github.com/firasdib/Regex101/issues/1208

cheako commented 2 months ago

As I said back on https://github.com/firasdib/Regex101/issues/165#issuecomment-332087227 enscripten will compile the original C into javascript... However at this point I's compile the original C into WASM. Either way, it's trivially easy to get the OG code into a format executable by modern browsers. There is no need to assist anyone in doing so, that person just needs to try for once.

jxu commented 2 months ago

If anyone is making an open source alternative, I will help to the extent I can