XSD regular expression flavor

buildingSMART / IDS

Computer interpretable (XML) standard to define Information Delivery Specifications for BIM (mainly used for IFC)

Other

163 stars 52 forks source link

XSD regular expression flavor #255

Open atomczak opened 1 month ago

atomczak commented 1 month ago

https://github.com/buildingSMART/IDS/blob/6d71cdf3547a0383c6cfbbead81a7cef7521ac3a/Development/IDS_oma.ids#L158

Found it in sample files and this doesn't look like a correct regular expression in the XSD pattern flavor. By default, all XSD patterns look at the whole phrase, so ^...$ are not needed (or even supported).

I'm not sure about the shorthand \d. I think it is supported by XSD and matches all Unicode digits: 0-9¹¾六௰Ⅹ೬Дに... but it would be good if someone could confirm.

CBenghi commented 1 month ago

Off memory I think I've probably removed that regex in the current Development branch, because it was conflicting with the datatype, anyway:

https://github.com/buildingSMART/IDS/blob/6d71cdf3547a0383c6cfbbead81a7cef7521ac3a/Development/IDS_oma.ids#L149-L161

My view is that IFCLENGTHMEASURE requires xs:double in the base type, which in turn disallows the pattern node.

Your point is of course still valid with respect to the need of documentation on regex flavour. My hope is to enforce it appropriately via the audit tool.

janbrouwer commented 1 month ago

I think I made that regex, an experiment to see if it is possible to validate a positivelengthmeasure, I believe the regex validation site mentioned in the IDS docs thought it ok, but they're are probably better ways to do this

gverduci commented 1 month ago

I'm not sure about the shorthand \d. I think it is supported by XSD and matches all Unicode digits: 0-9¹¾六௰Ⅹ೬Дに... but it would be good if someone could confirm.

@atomczak I think the shorthand \d is valid: this link shows all supported multi-character escapes:

https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/datatypes.html#cces-mce

and matches only \p{Nd} (Number of decimal digits - General category properties https://www.unicode.org/reports/tr18/#General_Category_Property).

Using the unicode database it is possible to find all characters in this set:

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

aothms commented 1 month ago

Great suggestions @gverduci, this indeed confirms @atomczak's suspicion:

$ grep ';Nd;' UnicodeData.txt | cut -d\; -f1 | xargs -I{} printf \\U000{} 2> /dev/null
𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩𐴰𐴱𐴲𐴳𐴴𐴵𐴶𐴷𐴸𐴹𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹...

(these are just a couple of them, I couldn't quickly figure out how to generically get the hex formatted code points to printable characters)

atomczak commented 1 month ago

Thanks all, I mainly wanted to be sure if I'm not mistaken. And yes, this example is already removed from latest Dev branch.

My hope is to enforce it appropriately via the audit tool.

I see a potential problem with auditing regex - ^ABC$ is not an invalid pattern. But it is checking for literal strings starting with caret and ending with dollar, and the user probably only wanted to allow 'ABC' value. So not an error but a soft warning :)

Using the unicode database it is possible to find all characters in this set

Thanks! If I read this right, \d in XSD represents 100 allowed digits. While this is fine for most cases, for my purpose [0-9] serves better, as I only want those 10.