admin-shell-io / aas-specs

Repository of the Asset Administration Shell Specification IDTA-01001 - Metamodel
https://admin-shell-io.github.io/aas-specs-antora/index/home/index.html
Creative Commons Attribution 4.0 International
47 stars 26 forks source link

No UTF32 Characters in the Regex for Strings #362

Closed sebbader-sap closed 2 months ago

sebbader-sap commented 7 months ago

Describe the bug

The regex pattern in the JSON Schema has only UTF-16 characters, while constraint AASd-130 demands the following: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$

Where JSON Schema, e.g., https://github.com/admin-shell-io/aas-specs/blob/2ab08f92bdd1d44edc1cfee52552fe5429d2178e/schemas/json/aas.json#L44C22-L44C36

      "pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

Additional context Needs to be adopted in the SwaggerHub Domains for Part 1 and Part 2.

mristin commented 7 months ago

@sebbader-sap we transpiled the patterns into UTF-16 since most JSON schema engines we tested operated on UTF-16 and could not handle UTF-32.

It is a trade-off between correctness and practicality -- if we put UTF-32 in JSON schema (e.g., the pattern you mentioned: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$), many JSON schema validators will fail. I suppose any schema validator written as a library for C# or Java will be in that group.

You can test it online. The first answer on Google for "JSON schema validator" for me is https://www.jsonschemavalidator.net/. This validator does support UTF-32:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}

The following value passes:

"test"

This is not a big change in aas-core-codegen, so whatever the decision, it shouldn't be hard to fix.

mristin commented 7 months ago

Please consider also the SHACL -- I think the same issue appears there as well.

sebbader-sap commented 7 months ago

The original pattern from the constraint has the same problems with OpenAPI-based validators, as they usually translate the YAML into JSON Schema --> then using the same JSON Schema Validation libraries with the same UTF-32 problems.

I am uncertain how to proceed now. Requirement 1: I want to transform aas.json into the Part 1 Domain. Requirement 2: Attributes from Part 2 with the same meaning as Part 1 attributes shall have exactly the same regex pattern. Requirement 3: Data conforming to Constraint AASd-130 should survive a validation. Requirement 4: Widely used libraries should accept the schema / OpenAPI files.

#3 and #4 are conflicting with each other. However, for Java-based servers, I need to adjust the pattern from Part 2 anyway, so I then rather want to see it 100% correct in the OpenAPI files and do the implementation-specific adjustments on top of it.

Which then means that my actually used OpenAPI file is not string-equals to the IDTA published one anymore...

sebbader-sap commented 7 months ago

@BirgitBoss I think we need a formal decision for all parts. Either way, the Part 2 Domain must go the same way as the Part 1 Domain & the schemas.

g1zzm0 commented 6 months ago

Because there was some clarification needed in the taskforce:

^: Asserts the start of the string.

[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]: Defines a character class that allows various Unicode characters.

\x09: ASCII horizontal tab. \x0A: ASCII linefeed (newline). \x0D: ASCII carriage return. \x20: ASCII space. -: Represents a range. \uD7FF: The upper limit of the Basic Multilingual Plane (BMP) in UTF-16. \uE000-\uFFFD: Represents the range of characters from the start of the supplementary planes up to the last valid Unicode character (excluding surrogate pairs). \u00010000-\u0010FFFF: Represents the range of valid surrogate pairs used for characters beyond the BMP. *: Allows for zero or more occurrences of the characters within the character class.

$: Asserts the end of the string.

g1zzm0 commented 6 months ago

Maybe we should change the Constraint AASd-130 in the following way:

An attribute with data type "string" shall consist of these characters only: The string contains only valid Unicode characters encoded in UTF-16 format The string can include common characters like tabs, newlines, carriage returns, and spaces. It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding. It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

mristin commented 6 months ago

I think that the most important bit is missing: this constraint is required, so that the text can be represented in XML.

For example, you can not represent \x00 in XML text, not even with �.

g1zzm0 commented 6 months ago

https://stackoverflow.com/questions/496321/utf-8-utf-16-and-utf-32

mristin commented 6 months ago

Here is a short example as illustration.

The smiley "😀" is represented as character code 128512. This code is encoded in UTF-32 as "\U0001F600". In UTF-16, this is encoded as "\ud83d\ude00".

Hence, if you want to match this smiley with an regex engine that uses UTF-16, you have to write the pattern "\ud83d\ude00" even though it is a single character. If your regex engine operates on UTF-32, you can simply write "\U0001F600".

mristin commented 6 months ago

For example, this schema works on an on-line JSON schema tester:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
}

This matches "😀" on https://www.jsonschemavalidator.net/.

The smiley does not test with:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}
sebbader-sap commented 6 months ago

Maybe we should change the Constraint AASd-130 in the following way:

Update from the latest state of Part 1 V3.1.0: Description for AASd-130 is already extended: https://github.com/admin-shell-io/aas-specs/blob/c9d6c3beb85ee13680ac543f602fc1e96fc57f9c/documentation/IDTA-01001/modules/ROOT/pages/Spec/IDTA-01001_Metamodel_Constraints.adoc?plain=1#L73

sebbader-sap commented 6 months ago

Proposal from a meeting of us (@mristin, @g1zzm0, and myself):

  1. AASd-130 shall be described using ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$ --> no further change needed.
  2. This pattern is mapped to ^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$ for the schemas. In particular, this is the case for the JSON Schema (which is already the case), OpenAPI Part 1 (also already the case, as Part 1 is a translation of the JSON Schema), and OpenAPI Part 2. It's not needed to introduce the pattern to the XML Schema, as XML is the one reason why AASd-130 is needed at first.

Therefore, the following activities are needed:

sebbader-sap commented 6 months ago

Side-effect: Depending on the implementation technology, developers must replace the pattern with the technology-matching regex variant of this pattern.

Example: ^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$ is the pattern in the official JSON schema.

  1. If the server is implemented in Python, which regex engine is already UTF-32 capable, the needed pattern is ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$ (the original one from AASd-130)
  2. If the server is coded with Java, the JSON Schema works out of the box.
  3. If it's in C#, even another pattern is needed: "^[\\u{9}\\u{a}\\u{d}\\u{20}-\\u{d7ff}\\u{e000}-\\u{fffd}\\u{10000}-\\u{10ffff}]*$"
  4. TypeScript: "^[\t\n\r -\ud7ff\ue000-\ufffd\U00010000-\U0010ffff]*$"
BirgitBoss commented 6 months ago

Decision Proposal TF Metamodel AAS 2024-03-27

Change formulation of Contraint AASd-130 from

Constraint AASd-130: an attribute with data type "string" shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.

to Constraint AASd-130: an attribute with data type "string" shall be restricted to the characters as defined in XML Schema 1.0, i.e. the string shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.

Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. See https://www.w3.org/TR/xml/#charsets for more information on XML Schema 1.0 string handling.

@g1zzm0 : please check

BirgitBoss commented 5 months ago

Proposal from a meeting of us (@mristin, @g1zzm0, and myself): [...]

2. This pattern is mapped to `^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$` 

[...]

This representation makes problems in swagger representation

image

How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning): "^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

@mristin may you please have a look whether the regex we are using is really ok? Thank you!

mristin commented 5 months ago

@BirgitBoss wrote:

How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning): "^([\t\n\r-\ud7ff\ue000-\ufffd]|\ud800[\udc00-\udfff]|[\ud801-\udbfe][\udc00-\udfff]|\udbff[\udc00-\udfff])*$"

There is no single standard syntax for regular expressions. It all depends on the engine that you plan to use and support.

Best you fix the engine & test against it, and then also document somewhere why you picked that engine and not another one.

Whatever engine you pick, the particular syntax will be incompatible with some other engine.

sebbader-sap commented 5 months ago

But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\\ud800") in the same pattern seems pretty strange.

mristin commented 5 months ago

But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\ud800") in the same pattern seems pretty strange.

Ah, I haven't even noticed that -- yes, it should be consistent.

BirgitBoss commented 5 months ago

Thank you @sebbader-sap and @mristin for reviewing, so we change to "^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

to have a consistent way and it would also be supported by swagger.

BirgitBoss commented 3 months ago

Workstream AAS Specs accepted

BirgitBoss commented 2 months ago

solved in IDTA-01001-3-0-1: https://industrialdigitaltwin.org/wp-content/uploads/2024/06/IDTA-01001-3-0-1_SpecificationAssetAdministrationShell_Part1_Metamodel.pdf