Open atomczak opened 1 month ago
… and this case is a Selection (any one of two or more facets)
u are subscribed to this thread.Message ID: @.**@.>>
yes, in RASE terms that would be a Selection. Choosing RegEx flavour that supports negation would enable Selection/Exception use cases without changing the IDS schema.
Worth referencing #29 for the origins of the decision to only target xs:pattern
regexs, and #177 for the wider Selection/Exclusion topic with regex.
I also found a useful resource comparing the different Regex flavours https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816 (with a handy table in the comments)
It feels like we should be able to find a baseline set of features that's above xsd's limited pattern that is widely enough supported to give a useful trade-off between functionality and breadth of implementor support.
Bearing in mind that almost all tech platforms can access 3rd party regex engine implementations, the decision to constrain IDS to a small subset just because, say Golang, doesn't support negative lookaheads in their standard library seems limiting.
Given the ubiquity of JavaScript and its standardisation I'd support basing features on the ECMA feature set. We could always subtract a features if it did look like a major tech segment would be hindered by its inclusion.
I think the concern here is not so much features, but that, due to complex interaction of such features, differences between implementations will originate. Even if ports of well known libraries exist, taking on @andyward's example of Go, and PCRE (which I think the closest to a defacto standard that's used as a reference in other implementations) it would mean you're stuck to a package updated 8 years ago, used 26 times. https://pkg.go.dev/github.com/gijsbers/go-pcre Maybe I'm unlucky in my example, but I don't think this the best way forward. I prefer to keep regexes as simple pattern matches and use proper semantic structures where needed.
For a moment, I thought this (test){0}
would work, but it looks like instead of "there can't be no "test", it works like "no test is fine".
How about instead of choosing one flavour, we select only shared aspects of popular regexes, based on the table shared by @andyward? This way we could make sure those are be supported by most languages.
Category | Feature | .NET | Java | PCRE | Python | XML |
---|---|---|---|---|---|---|
Characters | Backslash escapes one metacharacter | ✅ | ✅ | ✅ | ✅ | ✅ |
Characters | \n (LF), \r (CR) and \t (tab) | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes or Character Sets [abc] | [abc] character class | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes or Character Sets [abc] | [^abc] negated character class | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes or Character Sets [abc] | [a-z] character class range | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes or Character Sets [abc] | Backslash escapes one character class metacharacter | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes or Character Sets [abc] | \D, \W and \S shorthand negated character classes | ✅ | ✅ | ✅ | ✅ | ✅ |
Dot | . (dot; any character except line break) | ✅ | ✅ | ✅ | ✅ | ✅ |
Alternation | | (alternation) | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | ? (0 or 1) | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | * (0 or more) | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | + (1 or more) | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | {n} (exactly n) | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | {n,m} (between n and m) | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | {n,} (n or more) | ✅ | ✅ | ✅ | ✅ | ✅ |
Grouping and Backreferences | (regex) (numbered capturing group) | ✅ | ✅ | ✅ | ✅ | ✅ |
Characters | \x00 through \xFF (ASCII character) | ✅ | ✅ | ✅ | ✅ | ❌ |
Characters | \f (form feed) and \v (vtab) | ✅ | ✅ | ✅ | ✅ | ❌ |
Characters | \a (bell) | ✅ | ✅ | ✅ | ✅ | ❌ |
Character Classes or Character Sets [abc] | [\b] backspace | ✅ | ✅ | ✅ | ✅ | ❌ |
Anchors | ^ (start of string/line) | ✅ | ✅ | ✅ | ✅ | ❌ |
Anchors | $ (end of string/line) | ✅ | ✅ | ✅ | ✅ | ❌ |
Anchors | \A (start of string) | ✅ | ✅ | ✅ | ✅ | ❌ |
Quantifiers | ? after any of the above quantifiers to make it "lazy" | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | (?:regex) (non-capturing group) | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | \1 through \9 (backreferences) | ✅ | ✅ | ✅ | ✅ | ❌ |
Modifiers | (?i) (case insensitive) | ✅ | ✅ | ✅ | ✅ | ❌ |
Modifiers | (?s) (dot matches newlines) | ✅ | ✅ | ✅ | ✅ | ❌ |
Modifiers | (?m) (^ and $ match at line breaks) | ✅ | ✅ | ✅ | ✅ | ❌ |
Modifiers | (?x) (free-spacing mode) | ✅ | ✅ | ✅ | ✅ | ❌ |
Lookaround | (?=regex) (positive lookahead) | ✅ | ✅ | ✅ | ✅ | ❌ |
Lookaround | (?!regex) (negative lookahead) | ✅ | ✅ | ✅ | ✅ | ❌ |
Free-Spacing Syntax | Free-spacing syntax supported | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | \10 through \99 (backreferences) | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | Backreferences non-existent groups are an error | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | Backreferences to failed groups also fail | ✅ | ✅ | ✅ | ✅ | ❌ |
Free-Spacing Syntax | # starts a comment | ✅ | ✅ | ✅ | ✅ | ❌ |
We could also add regex test cases making sure that each implementation interprets regex features the same way.
I prefer to keep regexes as simple pattern matches and use proper semantic structures where needed.
Totally agree. The old jwz quip about "having a problem ... using regex and now having two problems" comes to mind. But given the design choices in IDS1.0, patterns often the only 'trap door' we have available to implement some of the more complex requirements. In particular the lack of 'exclusions' is a blocker.
But the point about Go and its patchy PCRE regex support kind of backs up my point. In our small niche, I'd wager every single IDS solution out there, whether commercial or open source is built in one of Java, Python, .NET, PHP or JavaScript. While I know they are all great languages, I'm unaware of any Go, Haskell or indeed Fortran 77 implementations of IFC (which is a pre-requisite for IDS model checking) - but we're concerning ourselves with how IDS could be supported in languages that have no penetration in our problem space. I feel like we maybe need to apply a bit of Pareto Principle?
How about instead of choosing one flavour, we select only shared aspects of popular regexes, based on the table shared by @andyward? This way we could make sure those are be supported by most languages.
Great I was doing something similar. There's significant commonality amongst those 4 mainstream engines (and seemingly ECMA too).
I agree on the test cases - this would help baseline what I suspect is a lot of different behaviour across implementors. There's probably only 5-6 of those features that are ever going to be used so that may limit the testing. (Anchors, Lookaround and maybe the modifiers)
There you go, I added ECMA (and JGSoft), resulting in excluding ten more rows (good!):
Category | Feature | .NET | Java | PCRE | Python | JGsoft | ECMA | XML |
---|---|---|---|---|---|---|---|---|
Characters | Backslash escapes one metacharacter | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Characters | \n (LF), \r (CR) and \t (tab) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes/Sets | [abc] character class | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes/Sets | [^abc] negated character class | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes/Sets | [a-z] character class range | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes/Sets | Backslash escapes one character class metacharacter | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Character Classes/Sets | \D, \W and \S shorthand negated character classes | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Dot | . (dot; any character except line break) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Alternation | | (alternation) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | ? (0 or 1) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | * (0 or more) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | + (1 or more) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | {n} (exactly n) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | {n,m} (between n and m) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Quantifiers | {n,} (n or more) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Grouping and Backreferences | (regex) (numbered capturing group) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Characters | \x00 through \xFF (ASCII character) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Characters | \f (form feed) and \v (vtab) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Character Classes/Sets | [\b] backspace | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Anchors | ^ (start of string/line) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Anchors | $ (end of string/line) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Quantifiers | ? after any of the above quantifiers to make it "lazy" | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | (?:regex) (non-capturing group) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | \1 through \9 (backreferences) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Lookaround | (?=regex) (positive lookahead) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Lookaround | (?!regex) (negative lookahead) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Grouping and Backreferences | \10 through \99 (backreferences) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Discussed in https://github.com/buildingSMART/IDS/discussions/347
Use Case: