Open hammer opened 10 years ago
I'm pretty sure all of that is already implemented. Is there something specific?
I see only "pcre (php)", "javascript", and "python" in the "flavor" section.
Yes, but POSIX-Extended regex falls under the PCRE-category.
On 06 Nov 2014, at 19:00, Jeff Hammerbacher notifications@github.com wrote:
I see only "pcre (php)", "javascript", and "python" in the "flavor" section.
— Reply to this email directly or view it on GitHub https://github.com/firasdib/Regex101/issues/165#issuecomment-62022306.
@firasdib: POSIX's matching behavior is different from PCRE/JavaScript/Python. POSIX searches for left-most longest string, while most other flavors gives the left most string, which is the first string which matches the expansion of the search tree.
A simple example is a|aaa
on input string "aaa". POSIX will always match "aaa" without fail, and other flavors will match "a".
By the way, collation is one feature available in POSIX which most other flavors are lacking of. The feature is powerful, and allows you to search for all characters similar to a
with very simple syntax.
I see. In that case I would need to compile a POSIX-regex engine down to JS. Is there a stand-alone C-library I can use, @nhahtdh ?
@firasdib: You may want to take a look at Boost Regex library, but I don't know whether it implements any extension to the base specification or not, since the document at opengroup leaves some undefined behavior.
Another thing is the locale-specific behavior of POSIX regex, which is currently not implemented in the interface.
I'm not sure what you'r looking for, but I would start with glibc implementation.
I think regex support is included in emscripten libc/musl/src/regex, you just need to write a wrapper and compile it.
I've tested this out on a simple example and emcc did all on it's own build and include the backend functions: https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L12716 https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L7033 Example usage: https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L1943
Now that there is a Javascript implementation, how long until we can see this being added?
Is this still being developed?
I figured out what I needed by writing many small regex programs, brute force. Though I would still like to see this implemented in case I need it in the future.
@cheako If you could explain the process more thoroughly, I could look into this again.
Is this related to https://github.com/firasdib/Regex101/issues/431 ?
That's great. All I had to do was use https://www.gnu.org/software/libc/manual/html_node/Matching-POSIX-Regexps.html and emscripten includes support automatically. As for if this is also the same as used in boost I wouldn't know.
Just keep in mind that there are a number of flags used when compiling posix regex, like to use an regex extension.
@cheako Thanks. I'll have to read up and see what the implications of implementing this flavor is.
Just ran into this when I tried my fav regex tester with a POSIX regex and it validated \d+
which is not valid in POSIX regex. It should be [0-9]+
or [[:digit:]]
So this tool needs a true POSIX flavour ;)
Getting that 7-year itch? :hourglass:
Hey, now this could be implemented simply with wasm. If it wasn't seen previously as simple.
I wasted my time putting a good comment in #1230 before I saw this issue. So I'll just copy it to here:
I can probably add some context here (in addition to an upvote for the feature request). From the grep man page:
grep understands three different versions of regular expression syntax: “basic” (BRE), “extended” (ERE) and “perl” (PCRE). In GNU grep there is no difference in available functionality between basic and extended syntaxes. In other implementations, basic regular expressions are less powerful. The following description applies to extended regular expressions; differences for basic regular expressions are summarized afterwards. Perl-compatible regular expressions give additional functionality, and are documented in pcresyntax(3) and pcrepattern(3), but work only if PCRE is available in the system.
The rest of that page goes on to document BRE vs. ERE and is a great reference for anybody looking. Long story short, the only difference between the two is escaping and handling of special characters (?, +, (), {}, |). Support is as follows:
- awk: only supports ERE
- sed: supports BRE (default) or ERE (with -E)
- grep: supports BRE (default) or ERE (with -E). May support PCRE (with -P) but this is pretty uncommon; bourne shell doesn't support it
BRE (default) and ERE (activated with the -E flag) are the two implementations that are always available, and are also used for sed. -P for PCRE is out there but pretty uncommon, and not available with sed or awk anyway.
So, long story short, adding BRE and ERE support to the best string-related resource out there would be amazing!
The wasm idea really isn't bad. The usual rust regex parser is closer to perl than BRE/ERE and I don't see any other implementations, but that would honestly be kind of fun to code.
I just want to add my support for adding both POSIX BRE and ERE. Relevant excerpt from re_format(7) man page:
REGEX(7) Linux Programmer's Manual REGEX(7)
NAME
regex - POSIX.2 regular expressions
DESCRIPTION
Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly
those of egrep; POSIX.2 calls these "extended" REs) and obsolete REs (roughly those of
ed(1); POSIX.2 "basic" REs). Obsolete REs mostly exist for backward compatibility in some
old programs; they will be discussed at the end. POSIX.2 leaves some aspects of RE syntax
and semantics open; "(!)" marks decisions on these aspects that may not be fully portable
to other POSIX.2 implementations.
Yes, but POSIX-Extended regex falls under the PCRE-category.
On 06 Nov 2014, at 19:00, Jeff Hammerbacher notifications@github.com wrote: I see only "pcre (php)", "javascript", and "python" in the "flavor" section. — Reply to this email directly or view it on GitHub #165 (comment).
no, it doesn't
the second line should match (and in fact does with libregex), and the parenthesised group should just be t
. In regex101 it doesn't match with any flavour.
Did you include the g
flag?
I've never encountered the {m}?
pattern before, what is that expected to match exactly?
I haven't looked into this, but just a plain reading and I see. The first line captures 51 chars and the second line only captures 1. For the pattern to make sense, though, it should end in $
?
I remember from when I was trying to use this, I tested a good number of settings and couldn't find one that matched experiment. If you feel a certain set of settings matches the behavior, please add an alias for those settings. That way, ppl won't have to go back and forth with you as the settings are further refined. I believe the power of this tool is the confidence using it gives, if users are required to guess settings that say one thing but are meant to mean another that removes this valuable feature.
@pke
Just ran into this when I tried my fav regex tester with a POSIX regex and it validated
\d+
which is not valid in POSIX regex. It should be[0-9]+
or[[:digit:]]
On the day you commented that, the current release of boost was v1.73.0 which supports \d
. However, it's not universal so regex101 would probably need to allow such escape sequences, but put a warning in the Explanation panel.
@knollet You have an off-by-one error. Your test pattern matches 51 characters in the subgroup, not 50: https://regex101.com/r/A12aQJ/1
@firasdib The {m}?
is valid syntax for making the previous atom ungreedy, but since {m}
is a fixed length, the ?
does nothing. It would have to be {m,n}?
or have an extra grouping like (.{50})?
to have any effect.
Still hoping ERE will be supported one day.
@pke
You have an off-by-one error. Your test pattern matches 51 characters in the subgroup, not 50: https://regex101.com/r/A12aQJ/1
I did test this, I got .{2}?.
is either 3 or 1 in size.
// Online C compiler to run C program online
#include <stdio.h>
#include <string.h>
#include <regex.h>
int main() {
char * source_a = "1234";
char * source_b = "12";
char * regexString = "(.{2}?.).*$";
size_t maxGroups = 2;
regex_t regexCompiled;
regmatch_t groupArray[maxGroups];
if (regcomp( & regexCompiled, regexString, REG_EXTENDED)) {
printf("Could not compile regular expression.\n");
return 1;
};
{
char * source = source_a;
if (regexec( & regexCompiled, source, maxGroups, groupArray, 0) == 0) {
unsigned int g = 0;
for (g = 0; g < maxGroups; g++) {
if (groupArray[g].rm_so == (size_t) - 1)
break; // No more groups
char sourceCopy[strlen(source) + 1];
strcpy(sourceCopy, source);
sourceCopy[groupArray[g].rm_eo] = 0;
printf("Group %u: [%2ld-%2ld]: %s\n",
g, groupArray[g].rm_so, groupArray[g].rm_eo,
sourceCopy + groupArray[g].rm_so);
}
}
}
{
char * source = source_b;
if (regexec( & regexCompiled, source, maxGroups, groupArray, 0) == 0) {
unsigned int g = 0;
for (g = 0; g < maxGroups; g++) {
if (groupArray[g].rm_so == (size_t) - 1)
break; // No more groups
char sourceCopy[strlen(source) + 1];
strcpy(sourceCopy, source);
sourceCopy[groupArray[g].rm_eo] = 0;
printf("Group %u: [%2ld-%2ld]: %s\n",
g, groupArray[g].rm_so, groupArray[g].rm_eo,
sourceCopy + groupArray[g].rm_so);
}
}
}
regfree( & regexCompiled);
return 0;
}
Group 0: [ 0- 4]: 1234
Group 1: [ 0- 3]: 123
Group 0: [ 0- 2]: 12
Group 1: [ 0- 1]: 1
@cheako I meant in @knollet's PCRE example, the ?
does not behave as a {0,1}
shortcut. It's acting as an ungreedy modifier for an atom that cannot match fewer than 50 characters, so it shouldn't be expected that the ()
group will match 50 characters (because of the extra .
). That makes the assertion "the second line should match" untrue.
Clearly the behaviour in ERE is different, as you demonstrated, which adds to the case for a separate ERE flavour.
... so it shouldn't be expected that the
()
group will match 50 characters (because of the extra.
). That makes the assertion "the second line should match" untrue.
I didn't say I expected the first group to match 50. I know it matches 51 OR 1.
I said, the second line should match completely. The parenthesized group matches t
(1 char, the last .
in the group), and the .*
(greedily) matches the rest. A $
for it to "make sense" is not really neccessary, it only may be for some use cases, which I am not talking about here.
@cheako I meant in @knollet's PCRE example, the ? does not behave as a {0,1} shortcut. It's acting as an ungreedy modifier
How would it be possible for ?
to act as an ungreedy modifier on something, which in itself has no options? x{50}
matches exactly 50 x
es. Not "one OR more" as +
, which you could modify with "as few as possible" or "as many as possible".
But ok, the very similar x{1,50}
on the other hand would make sense to modify. I would have to check if it is.
Still, there is a certani behaviour of posix-extended which is not represented in any of these flavours.
@firasdib
Did you include the g flag?
Yes. But I don't see what difference that should make. Even without g
as meaning "global" giving all the matches instead of only the first should still return the ... first. But it returns none.
@knollet I misunderstood you. You meant ‘If PCRE and ERE had equivalent matching, the second line would match in this PCRE example, but it doesn't.’ We can verify your example with grep
as an ERE:
grep -E '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567
then PCRE:
grep -P '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 1234567
I don't get how that test is accurate, you are not printing the capture group.
I'm unsure where grep(is that a GNU extension?) gets its PCRE *implementation so for those playing along at home YMMV, if you don't get good results you can try this emulator: https://www.tutorialspoint.com/linux_terminal_online.php
$ perl -e'print"$1:$2\n"if("1234" =~ /((.{2}?.).*)$/)'
1234:123
$ perl -e'print"$1:$2\n"if("12" =~ /((.{2}?.).*)$/)'
$
@cheako It doesn't need to print the subgroup, it's a logical conclusion. The PCRE test simply doesn't match the line that's only 50 characters (ending in 6
), but the ERE does. The pattern as a whole is greedy (in both flavours) because of the trailing .*
so can only ever match once per line. Therefore, the ERE matching the line at all means it must be capturing only 1 character in the group, followed by ‘the rest’. This agrees with the findings of your C program.
According to this post, grep
has had -P
support, using libpcre
for about 23 years, so its man page warning about it being "experimental" means it's been a very long experiment. Regardless, Perl is not PCRE, so if you distrust grep
, you can get the same result directly from PCRE using pcregrep
or pcre2grep
:
pcre2grep '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 1234567
I also found this article that suggests the {m}?
construct may be undefined behaviour but, much like grep
's "expiremental" -P
, is probably consistent because maintainers of old software don't like breaking compatibility.
According to this post, grep has had -P support, using libpcre for about 23 years, so its man page warning about it being "experimental" means it's been a very long experiment. Regardless, Perl is not PCRE, so if you distrust grep, you can get the same result directly from PCRE using pcregrep or pcre2grep
This is just madness
If this hasn't been added in nine years and it's not open source, the author won't implement it. Unfortunately I haven't found an open source tool as good.
@jxu: if you can transpile it to js, I can work with it. Vänliga hälsningar / Best regards,Firas DibOn 5 Aug 2023, at 04:25, jxu @.***> wrote: If this hasn't been added in nine years and it's not open source, the author won't implement it. Unfortunately I haven't found an open source tool as good.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
@firasdib That was done, but honestly at this point it would be better to compile the C into wasm. Either solution would work. Start with the connecting tissue between the POSIX API and the internal representation.
Is this still being worked on? I love regex101, but I am examining the xz backdoor and it would be VERY helpful to be able to parse grep regexes
just another me too comment
If anybody wants to help move this along, it is possible to provide an implementation that the author will be able to pick up.
I think the easiest way to support BRE and ERE is probably to fork my repo, which was used to add Rust to regex101, (https://github.com/tgross35/wasm-regex), and add new entrypoints that make use of the onig
crate (bindings to Oniguruma). Use Syntax::posix_basic()
or Syntax::posix_extended()
for BRE/ERE.
Swapping regex implementations within my repo should be easy since only the engine needs to change, the annoying parts stay the same (UTF8 <-> UTF16 transcoding, UTF16 offset calculation, serialization, interfaces, list replacements, flags, CI, building for WASM, size minimization ...).
The onig
crate also provides a ton of other flavors - emacs, ruby, perl, python, GNU, java. In theory you could probably replace the existing Java and Python implementations if their WASM is heavy 😄.
See also the discussion we had adding Rust https://github.com/firasdib/Regex101/issues/1208
As I said back on https://github.com/firasdib/Regex101/issues/165#issuecomment-332087227 enscripten will compile the original C into javascript... However at this point I's compile the original C into WASM. Either way, it's trivially easy to get the OG code into a format executable by modern browsers. There is no need to assist anyone in doing so, that person just needs to try for once.
If anyone is making an open source alternative, I will help to the extent I can
As implemented by Boost, for example.