accordproject / concerto

Business schema language and runtime
https://concerto.accordproject.org
Apache License 2.0
110 stars 100 forks source link

Document / Clarify naming rules for identifiers #883

Open dselman opened 1 month ago

dselman commented 1 month ago

Bug Report 🐛

.cto (o String \u0032\u005C3D) --parse--> AST ("2\3D") --serialize--> JSON ("2\3D") -serialize--> Concerto (\u0032\u005C3D)

The regex should apply to the name after parsing (i.e. 2\3D in this example), because that's the true name, and that's what's stored in the Concerto object for the AST. The name/AST can be rendered into JSON (using JSON escaping rules) or rendered into Concerto (using Concerto escaping rules), but the escaping is not part of the value. It's part of the serialization format (JSON, Concerto, whatever).

Expected Behavior

Unclear. We need to tighten / document the expectations for round tripping from CTO to JSON of escaped identifiers and ensure that the CTO parser grammar is in sync with the AST regex, and code that checks for valid identifiers.

Current Behavior

Possible Solution

Steps to Reproduce

1. 2. 3. 4.

Context (Environment)

Desktop

Detailed Description

Possible Implementation

DS-AdamMilazzo commented 1 month ago

The regular expression in the metamodel expresses the rules for a valid identifier in the Concerto (.cto) language, including the escaping rules allowed in the Concerto (.cto) language, but it shouldn't do that.

For example, you can define an enum in Concerto like:

enum Dimensions {
  o \u0031D // 1D
  o \u0032D // 2D
  o \u0033D // 3D
}

This is legal, and it gives you enum values that look like this in the AST (when saved as JSON):

[
  {"$class": "concerto.metamodel@1.0.0.EnumProperty", "name": "1D"},
  {"$class": "concerto.metamodel@1.0.0.EnumProperty", "name": "2D"},
  {"$class": "concerto.metamodel@1.0.0.EnumProperty", "name": "3D"}
]

This model works fine, and you can validate objects against it: {"$class":"...", "dims": "2D"}

Where it breaks down is when trying to validate the AST JSON against the metamodel, because the metamodel has this regex validator applied to the "name" field of the EnumProperty concept (as well as the Property and Declaration concepts):

/^(\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl}|\$|_|\\u[0-9A-Fa-f]{4})(?:\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl}|\$|_|\\u[0-9A-Fa-f]{4}|\p{Mn}|\p{Mc}|\p{Nd}|\p{Pc}|\u200C|\u200D)*$/u

But this regex is expressing the rules of (unparsed, potentially escaped) identifier strings in the Concerto (.cto) language, which it shouldn't be doing. The strings in the AST are already parsed and are not escaped. Because the value from the AST (e.g. "2D") doesn't match this regex, the model fails to validate against the metamodel, even though the model is valid.

Put graphically with a similar example, we have something like this:

Concerto language (\u0032\u005C3D) --parse--> AST ("2\3D") --serialize--> JSON ("2\\3D")
                                                          \--serialize--> Concerto language (\u0032\u005C3D)

The regex should apply to the name after parsing (i.e. 2\3D in this example), because that's the true name, and that's what's stored in the Concerto object for the AST. The name/AST can be rendered into JSON (using JSON escaping rules) or rendered into Concerto (using Concerto escaping rules), but the escaping is not part of the value. It's part of the serialization format (JSON, Concerto, whatever).

A somewhat analogous bug would be if the regex was trying to account for JSON escaping rules, when it would never receive a JSON-escaped value because the JSON unescaping would already have been done by the JSON parser. In this case, the regex accounts for Concerto escaping rules, but it never receives a Concerto-escaped value because the unescaping has already been done by the Concerto parser.

DS-AdamMilazzo commented 1 month ago

I'm not sure the "Difficulty: Challenging" tag is quite warranted, unless there are backwards-compatibility concerns or you want to change the design. I'm not aware of any backwards-compatibility concerns, though. The parser seems to be parsing the name correctly, and the AST seems to be storing it correctly, and in most or all cases the Concerto tools work fine with the generated model and interpret the names from the AST correctly. It's just that some names which should be legal (according to the Concerto language) are considered to be invalid according to the metamodel. Fixing the regex just makes something work that didn't work before. Also, I think the design for Concerto names is okay and nothing needs to change except the regex.

The solution is simply to express any identifier naming rules that you want in the metamodel against the parsed, non-escaped name rather than the Concerto-escaped name. The current regex is purely syntactic and allows any non-empty name as long as it's appropriately escaped, so with escaping out of the picture "any non-empty name" can be matched by simply replacing the regex with a length validator ([1,]).

However, you have the option of making the regex smarter than that by incorporating some of your higher-level rules that are currently only expressed in Javascript code. For example, you probably have a rule that says "no declaration can be named '$class' or '$identifier' because those are reserved property names", and that rule could potentially be expressed in the regex. But that is optional. The current regex doesn't attempt that kind of validation at all, and attempting to add new validations into the regex is not really in the scope of fixing the existing regex bug.

dselman commented 1 month ago

Thank you @DS-AdamMilazzo that is very helpful. Another dimension to consider is all the code generation targets. An identifier like 2D would need to be escaped/mangled to become a valid identifier in C#, Java etc.

DS-AdamMilazzo commented 1 month ago

Thank you @DS-AdamMilazzo that is very helpful. Another dimension to consider is all the code generation targets. An identifier like 2D would need to be escaped/mangled to become a valid identifier in C#, Java etc.

Yes, but only the JSON property name truly needs to be preserved, and I think all those languages have a way to specify the JSON property name used for serialization. If 2D became _2D in C#/Java, I don't think anyone would mind. Like all escaping schemes, some cleverness is needed to avoid collisions (e.g. we can't blindly escape 2D into _2D because somebody could have also defined _2D in the Concerto model, in theory). But that's pretty easy to deal with, I think.