buildingSMART / IDS

Computer interpretable (XML) standard to define Information Delivery Specifications for BIM (mainly used for IFC)
https://www.buildingsmart.org/standards/bsi-standards/information-delivery-specification-ids/
Other
215 stars 66 forks source link

Flavour of RegEx allowing negation #349

Open atomczak opened 1 month ago

atomczak commented 1 month ago

Discussed in https://github.com/buildingSMART/IDS/discussions/347

Originally posted by **atomczak** September 25, 2024 I want to have a requirement that takes all Walls that don't have the "FireRating" property. In other words, I want an applicability to include IFCWALL entity and a property allowing any name/pset except "FireRating" and "Pset_WallCommon". I thought I could do that with a pattern, but I see that XML flavour doesn't really support a negation lookahead. Has anyone found a way to express it?

Use Case:

Suppose we have 2 properties, “ID 1” and “ID 2”. BIM team is required to assign either 1 of the 2 properties. Difficult to write the “ids:applicability” section for such cases. E.g.,

<ids:applicability>
ID 1 property does not exist
</ids:applicability>
<ids:requirements>
ID 2 is mandatory
</ids:requirements>

As a solution, I propose to reconsider agreeing on RegEx flavours other than XSD, such as PCRE or JavaScript / Python. We could also agree on explicit IDS flavour, but this will make implementation harder.

NickNisbet commented 1 month ago

… and this case is a Selection (any one of two or more facets)

u are subscribed to this thread.Message ID: @.**@.>>

atomczak commented 1 month ago

yes, in RASE terms that would be a Selection. Choosing RegEx flavour that supports negation would enable Selection/Exception use cases without changing the IDS schema.

andyward commented 1 month ago

Worth referencing #29 for the origins of the decision to only target xs:pattern regexs, and #177 for the wider Selection/Exclusion topic with regex.

I also found a useful resource comparing the different Regex flavours https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816 (with a handy table in the comments)

It feels like we should be able to find a baseline set of features that's above xsd's limited pattern that is widely enough supported to give a useful trade-off between functionality and breadth of implementor support.

Bearing in mind that almost all tech platforms can access 3rd party regex engine implementations, the decision to constrain IDS to a small subset just because, say Golang, doesn't support negative lookaheads in their standard library seems limiting.

Given the ubiquity of JavaScript and its standardisation I'd support basing features on the ECMA feature set. We could always subtract a features if it did look like a major tech segment would be hindered by its inclusion.

aothms commented 1 month ago

I think the concern here is not so much features, but that, due to complex interaction of such features, differences between implementations will originate. Even if ports of well known libraries exist, taking on @andyward's example of Go, and PCRE (which I think the closest to a defacto standard that's used as a reference in other implementations) it would mean you're stuck to a package updated 8 years ago, used 26 times. https://pkg.go.dev/github.com/gijsbers/go-pcre Maybe I'm unlucky in my example, but I don't think this the best way forward. I prefer to keep regexes as simple pattern matches and use proper semantic structures where needed.

atomczak commented 1 month ago

For a moment, I thought this (test){0} would work, but it looks like instead of "there can't be no "test", it works like "no test is fine".

How about instead of choosing one flavour, we select only shared aspects of popular regexes, based on the table shared by @andyward? This way we could make sure those are be supported by most languages.

Category Feature .NET Java PCRE Python XML
Characters Backslash escapes one metacharacter
Characters \n (LF), \r (CR) and \t (tab)
Character Classes or Character Sets [abc] [abc] character class
Character Classes or Character Sets [abc] [^abc] negated character class
Character Classes or Character Sets [abc] [a-z] character class range
Character Classes or Character Sets [abc] Backslash escapes one character class metacharacter
Character Classes or Character Sets [abc] \D, \W and \S shorthand negated character classes
Dot . (dot; any character except line break)
Alternation | (alternation)
Quantifiers ? (0 or 1)
Quantifiers * (0 or more)
Quantifiers + (1 or more)
Quantifiers {n} (exactly n)
Quantifiers {n,m} (between n and m)
Quantifiers {n,} (n or more)
Grouping and Backreferences (regex) (numbered capturing group)
Characters \x00 through \xFF (ASCII character)
Characters \f (form feed) and \v (vtab)
Characters \a (bell)
Character Classes or Character Sets [abc] [\b] backspace
Anchors ^ (start of string/line)
Anchors $ (end of string/line)
Anchors \A (start of string)
Quantifiers ? after any of the above quantifiers to make it "lazy"
Grouping and Backreferences (?:regex) (non-capturing group)
Grouping and Backreferences \1 through \9 (backreferences)
Modifiers (?i) (case insensitive)
Modifiers (?s) (dot matches newlines)
Modifiers (?m) (^ and $ match at line breaks)
Modifiers (?x) (free-spacing mode)
Lookaround (?=regex) (positive lookahead)
Lookaround (?!regex) (negative lookahead)
Free-Spacing Syntax Free-spacing syntax supported
Grouping and Backreferences \10 through \99 (backreferences)
Grouping and Backreferences Backreferences non-existent groups are an error
Grouping and Backreferences Backreferences to failed groups also fail
Free-Spacing Syntax # starts a comment

We could also add regex test cases making sure that each implementation interprets regex features the same way.

andyward commented 1 month ago

I prefer to keep regexes as simple pattern matches and use proper semantic structures where needed.

Totally agree. The old jwz quip about "having a problem ... using regex and now having two problems" comes to mind. But given the design choices in IDS1.0, patterns often the only 'trap door' we have available to implement some of the more complex requirements. In particular the lack of 'exclusions' is a blocker.

But the point about Go and its patchy PCRE regex support kind of backs up my point. In our small niche, I'd wager every single IDS solution out there, whether commercial or open source is built in one of Java, Python, .NET, PHP or JavaScript. While I know they are all great languages, I'm unaware of any Go, Haskell or indeed Fortran 77 implementations of IFC (which is a pre-requisite for IDS model checking) - but we're concerning ourselves with how IDS could be supported in languages that have no penetration in our problem space. I feel like we maybe need to apply a bit of Pareto Principle?

andyward commented 1 month ago

How about instead of choosing one flavour, we select only shared aspects of popular regexes, based on the table shared by @andyward? This way we could make sure those are be supported by most languages.

Great I was doing something similar. There's significant commonality amongst those 4 mainstream engines (and seemingly ECMA too).

I agree on the test cases - this would help baseline what I suspect is a lot of different behaviour across implementors. There's probably only 5-6 of those features that are ever going to be used so that may limit the testing. (Anchors, Lookaround and maybe the modifiers)

atomczak commented 1 month ago

There you go, I added ECMA (and JGSoft), resulting in excluding ten more rows (good!):

Category Feature .NET Java PCRE Python JGsoft ECMA XML
Characters Backslash escapes one metacharacter
Characters \n (LF), \r (CR) and \t (tab)
Character Classes/Sets [abc] character class
Character Classes/Sets [^abc] negated character class
Character Classes/Sets [a-z] character class range
Character Classes/Sets Backslash escapes one character class metacharacter
Character Classes/Sets \D, \W and \S shorthand negated character classes
Dot . (dot; any character except line break)
Alternation | (alternation)
Quantifiers ? (0 or 1)
Quantifiers * (0 or more)
Quantifiers + (1 or more)
Quantifiers {n} (exactly n)
Quantifiers {n,m} (between n and m)
Quantifiers {n,} (n or more)
Grouping and Backreferences (regex) (numbered capturing group)
Characters \x00 through \xFF (ASCII character)
Characters \f (form feed) and \v (vtab)
Character Classes/Sets [\b] backspace
Anchors ^ (start of string/line)
Anchors $ (end of string/line)
Quantifiers ? after any of the above quantifiers to make it "lazy"
Grouping and Backreferences (?:regex) (non-capturing group)
Grouping and Backreferences \1 through \9 (backreferences)
Lookaround (?=regex) (positive lookahead)
Lookaround (?!regex) (negative lookahead)
Grouping and Backreferences \10 through \99 (backreferences)
Characters \a (bell)
Anchors \A (start of string)
Modifiers (?i) (case insensitive)
Modifiers (?s) (dot matches newlines)
Modifiers (?m) (^ and $ match at line breaks)
Modifiers (?x) (free-spacing mode)
Free-Spacing Syntax Free-spacing syntax supported
Grouping and Backreferences Backreferences non-existent groups are an error
Grouping and Backreferences Backreferences to failed groups also fail
Free-Spacing Syntax # starts a comment