Open eclipse-ocl-bot opened 2 months ago
By Ed Willink on Jun 21, 2020 03:14
(In reply to Ed Willink from comment #0)
Xtext already has serialization support. No change needed.
No. This is where it all goes wrong. The creation of strictly alternating Text/HiddenRegions makes premature whitespace decisions (and needs backtracking).
Coming from a QVT perspective, I see a pattern matching problem. Discover a satisfactory covering of the user EObject graph with the grammar production rule patterns/sub-graphs. We could therefore use the QVTd scheduler, but infecting Xtext with a dependency on an immature QVTd is very undesirable, and because the Xtext problem is much simpler; no intermediate nodes/matching cascades. Every production rule produces precisely one EObject with a small, usually one, number of possible EClasses. The slots of the one created user EObject is populated by a single depth of EStructuralFeature assignments. Two good reasons for a custom solution much of which can be performed at compile time to provide expectation checks at the choice points avoiding the need for naive exploration of each choice until a good one is found which nessitatesg backtracking for each bad choice.
We need at compile time
It would appear that the very strong almost 1:1 relationship between user EClass and grammar EClass allows the "prune candidates to credible containment hierarchy" to identify a single canndidate 90% if not 99% of the time; few EObject-level choces are necessary.
By Sebastian Zarnekow on Jun 22, 2020 05:18
Hi Ed,
thank you for the effort you are putting into this. A few remarks on the conclusions that you made.
The creation of strictly alternating Text/HiddenRegions makes premature whitespace decisions (and needs backtracking).
How is the fact that we do have strictly alternating regions relevant / a show-stopper? Can you please elaborate.
Every production rule produces precisely one EObject
This is not true. Example:
MyMultiObjectProducer: \ {First} ({More.prev = current} 'more') ({EvenMore.prev = current} 'even' 'more') {YetMore.prev = current}\ ;
Please also see the next example with an opposite case.
The slots of the one created user EObject is populated by a single depth of EStructuralFeature assignments.
This is not true. Example where we have three rules but only instantiate a single object:
RuleA returns Instance:\ RuleB fromRuleA = ID\ ;
RuleB returns Instance:\ RuleC fromRuleB = ID\ ;
RuleC returns Instance:\ {Instance} fromRuleC = ID\ ;
Another example with fragments:
RuleA returns Instance;\ {Instance} Fragment\ ;
fragment Fragment *:\ fromFragment = ID\ ;
We need at compile time [snip]
This is basically how the serializer works (on a very high level).
that the very strong almost 1:1 relationship between user EClass and grammar EClass
I'm afraid this is not true. Almost all reasonable complex languages have some form of expression hierarchy that clearly does not fall into that category.
What I understand from your conclusion is a vast simplification of the underlying problem by ignoring the expressiveness of the grammar language.
I'm sure that a formatter for a small set of even complex language may appear to be a problem that can be solved rather easily. Providing a generally applicable infrastructure is a hard problem and the co-notation of your comments does not suggest that you already explored the entire problem space. Before you put to much time in a solution that is not generally applicable, I strongly suggest to implement a solution for your languages at hand and deduce a general solution afterwards.
By Ed Willink on Jun 22, 2020 07:33
Ok. Clearly the Xtext grammar is much more powerful than I ever use, perhaps more than anyone should use. Once I see an example of the horrors I might study it to see identify an M2M from horrible hierarchical rules to simple one level rules that I have exploited. (Of course I have a phobia about using precedence so I don't.)
My minimum goal is a formatter for the new generator that can accurately format comments.
Strictly alternating text/hidden regions requires the concatenation of
By Sebastian Zarnekow on Jun 22, 2020 07:59
Ok. Clearly the Xtext grammar is much more powerful than I ever use, perhaps more than anyone should use.
Claiming that nobody should use something because you as an individual do not use it .. appears a little ignorant to me. But that's probably just me.
headed off down the rewrite path
I'm curious to learn from your solution.
By Ed Willink on Jun 27, 2020 06:03
(In reply to Sebastian Zarnekow from comment #4)
Ok. Clearly the Xtext grammar is much more powerful than I ever use, perhaps more than anyone should use.
After looking more carefully....
Multiple rules producing the same type is not unexpected. When I wrote
"Every production rule produces precisely one EObject with a small, usually one, number of possible EClasses."
I did not exclude the possibility that there may be many rules producing the same EClass. This is why a transitive containment analysis is helpful to prune many candidate rules that are impossible before even considering the children.
Your "{YetMore.prev = current}" frightened me but actually in OCL I have a
ExpCS returns ExpCS:\ (PrefixedPrimaryExpCS ({InfixExpCS.ownedLeft=current} name=BinaryOperatorName ownedRight=ExpCS)?)\ | PrefixedLetExpCS;
to contend with. For the M2T serialization purpose I think this 'multiplies out' to the three alternatives:
ExpCS returns ExpCS: PrefixedPrimaryExpCS\ ExpCS returns InfixExpCS: ownedLeft=PrefixedPrimaryExpCS name=BinaryOperatorName ownedRight=ExpCS\ ExpCS returns ExpCS: PrefixedLetExpCS
allowing the type-plus-contents vetting to choose the rule. In this particular case the user EClass is nicely unique.
I suspect that your YetMore is amenable to similar flattening/permutation of the options.
By Sebastian Zarnekow on Jun 27, 2020 15:59
Hi Ed,
This is why a transitive containment analysis is helpful to prune many candidate rules
please take a look at the generated semantic sequence for your language. You'll find that it does exactly that. And it pretty much implements a bunch of the other ideas you outlined so far. Maybe you can \ a) take some inspiration from that, and\ b) find other important aspects that are potentially relevant to your design of the formatter.
By Ed Willink on Jun 29, 2020 03:22
(In reply to Ed Willink from comment #1)
(and needs backtracking).
While backtracking is not necessary for truly pathological cases, some hard solution finding is necessary.
(Converting rules to a disjunctive normal form of alternate sequences with multiplicity avoids the horrors of nested alternatives, but nested sequences can also be horrible.)
The typical:
x+=X (',' x+=X)*
is amenable to a simple solution of 11 + N21 = x.size() where N2 is the iteration count of the second rule element.
But cascaded homogeneous sequences:
x+=X (',' x+=X '+' x+=X) (',' x+=X '+' x+=X '/' x+=X)
This requires an integer solution to 11 + N22 + N3*3 = x.size() which is impossible for x.size() = 2 and ambiguous for large x.size(). This is a totally unrealistic example since, if the 2/3-element groups are significant to the user then it is very strange that this grouping is lost in the model. I can only conceive of it happening through evolution, the metamodel used to persist grouping and it was removed. But then the grammar should have evolved to have a no-group syntax that should lexically occlude the legacy syntaxes.
Nested homogeneous sequences are even worse and completely stupid:
x+=X ((',' x+=X '+' x+=X) (',' x+=X '+' x+=X '/' x+=X)) (',' x+=X '+' x+=X '/' x+=X x+=X)
We now have to solve for N2, N2.1, N2.2 and N3 with potentially distinct values for N2.1 and N2.2 for each N2 iteration. For large x.size() there may be numerous solutions for which we pick the left-aggressice first. If the sequences invoke a perverse mixture of multiple assignments in varying ratios, the integer solution may be sufficiently hard as to require a PIP solver. A significant effort/overhead for a cloud cuckoo land problem. But a 10 dimensional search over a 1000 model element could take forever.
The complexity of communicating the pathological solution tree potentially doubles the intermediate serialization overheads. Probably need three gears;
By Ed Willink on Jul 22, 2020 04:02
In practice there seem to be a very small number of over-specified simultaneous equations to solve to determine the actual repeat count of each grammar term from the actual slot sizes; pace a little care wrt eIsSet and rules that do/do-not use a default value. These seem to be soluble at compile time avoiding run-time costs. However some fall-back is necessary for the complete stupid / unforeseen case.
Next problem: comments. An additional motivation. Fix Bug 405184.
After some consideration it appears that Terminals.xtext has given everyone a very bum steer by demonstrating that major documentation comments can be hidden and so not part of the true model (Xtext.ecore omits comments).
Bug 565412 discusses how comments could be handled more sensibly to avoid some of the current OCLinEcore kludges.
By Ed Willink on Aug 06, 2020 03:30
Heterogeneous multi-assignments can cause pathological problems. The EssentialOCL RoundBracketedClauseCS is:
x+=X1 x+=(X2|X3)*
which is flattened to
x+=X1 x+=X2* x+=X3*
(oops) violating the ordering but legal if not ordered. However whereas the attribute equivalent
x+='X1' x+=('X2'|'X3')*
has inherently orthogonal X1/X2/X3, the rule calls for reference assignments may not be obviously orthogonal and might require deep run-time solving/backtracking.
For the EssentialOCL RoundBracketedClauseCS ,the rules are trivially orthogonal and the flattening was invalid anyway.
x+=X1 x+=(X2|X3)*
must be retained as a leaf alternative to allow the elements of x to be consumed in actual mixed X2/X3 order.
A valid heterogeneous challenge arises with PathNameCS
x+=X1 ('::' x+=X2)*
but again it's not really a problem since the first x must be an X1 and the others X2. Obviously if x is not ordered a deterministiic serialization is impossible, and searching x for for the first X1 compatibility is unnecessary.
So while sensible cases of heterogeneous multi-assignments are statically soluble, a pathological problem exists for not-orthogonal X1, X2:
... x+=X1* ... x+=X2* ...
when an indeterminate number of X1 are followed by an indeterminate number of X2.
By Sebastian Zarnekow on Aug 06, 2020 03:56
@Ed is this in fact a ticket that should be moved to OCL for now?
By Ed Willink on Aug 06, 2020 06:31
Move to OCL if you like, but the solution is intended to be independent of OCL/QVTd although developed using OCL grammars as examples and in the OCL GIT.
I hope to get the OCL/QVTd editors fully away from old Xtext infrastructure for RC1, however there is very little likelihood that all 'discussions' about migrating to Xtext GIT will be complete by then. There is also some vagueness as to how/if pathologically stupid grammar serialization is accommodated.
By Sebastian Zarnekow on Aug 06, 2020 06:33
there is very little likelihood that all 'discussions' about migrating to Xtext GIT will be complete by then
True. Let's get this battle-tested first and than see how to proceed from there.
By Ed Willink on Aug 10, 2020 03:24
It works - for the hierarchy of OCL editors; using a .idioms model per grammar, re-using via an import. Only custom code is a single CommentSegmentSupport class whose getComment(EObject) method finds the comment body for a semantic element. The ValueConverterService and CrossReferenceSerializer are re-used for DataType / reference serialization. An Xtext grammar could make the .idioms model much easier to develop / understand.
Need some usage experience to discover the idiom idioms for problems such as 'the first colon in an X has no preceding space but the second does'. Probably solved by an idiom comprising a (first) colon-in-X locator followed by a (second) colon-in-X locator. These locations are moderately stable avoiding the regular generated formatter failures that currently occur as ...comma_5_1 moves to ...comma_7_1.
The problem of blank lines between paragraphs but not before first or after last seems to have a nice solution with a half-new-line character that prefixes and suffixes each paragraph (rule). The intervening two halves become one, the outer halves vanish. Much nicer than fighting the groups to try and get the right member,
'just' need to persist the compile-time analysis in a Java file to avoid the current lazy analysis every serialization.
(In reply to Ed Willink from comment #7)
The latter has some analogies to backtracking though we just search for a first valid solution parameterization rather than repeatedly generate and reject actual serializations.
The OCL grammar has an unpleasant duplication of expression and let-expression rules to avoid a parser backtrack / shift-reduce conflict. The serializer has to choose the correct alternative which cannot be done locally. Overall the search space is a tree of all possible alternative serializations, one node for each possible serialization rule of each user EObject, one edge for each assignment. The static analysis substantially reduces the alternatives per user EObject and eases dynamic checking. A depth first search finds the first universally successful match. Most nodes are locally/ancestrally resolved, but occasionally child nodes must be checked to discover and propagate a non-match backwards. This is a matching stage 'backtrack'.
Instrumenting shows that for OCL nesrly 10 alternatives are (re-)matched for user EObject. Introducing a cache to re-use the lookahead reduces the overhead to less than two match attempts per user EObject. The instrumenting does not distinguish between rapid local rejections and deep failures. For well-behaved grammars will just be a trivial descent, for bad grammars the match miss overhead is controlled.
(Serialization difficulties could fairly easily be resolved by grammar/semantic model tweaks, but we want to cope with real user problems, so no CS model changes for now.)
(For the pathological case, a naive search through the integer possibilities of each term multiplicity is needed. Since most non-trivial searializations occur for a small number of objects, a naive search may not be too horrible in performance. But it is not needed for OCL; the cardinalities of all terms of all rules are amenable to a statically planned solution.)
By Ed Willink on Sep 26, 2020 04:33
Bug 567383 handles the evolution to make the development acceptable to the Xtext project.
By Ed Willink on Sep 28, 2020 04:03
(In reply to Ed Willink from comment #13)
It works
Aargh! as well as the near-total formatting for an OCLinEcore-style *.ecore load, there is the reformatting for e.g. a selected Ctrl-Shift F range.
A first attempt tries to identify the narrowest EObject that covers the range and serialize that, but there are edge effect blending challenges aggravated by Xtext's treatment of a new-line as part of the subsequent semantic object.
But this ignores the real challenge; given some Java such as "abstract\t \t static\npublic" reformatting should respect the user's preferred ordering to give "abstract static public" rather than impose the single deterministic total serialiation order.
In the reformatting case, the tree of serialization rules can be identified from the semantic EObjects just as for the total formatting. The INode model is available so the incoming text range for each SerializationStep can be correlated ignoring strict order where necessary. Formatting therefore just serializes the tail SerializationSegments of the preceding SerializationStep and the head SerializationSegments of the following SerializationStep. Reformatting therefore just resynthesizes the whitespace. If it occasionally gets it slightly wrong it will only be cosmetically wrong.
The existing formatters have an identify-the-indentation algorithm which could allow resynthesized whitespace to blend better with the inconsistent indentation of surrounding lines, rather than imposing a globally correct indentation locally. Maybe this is a good idea.
By Ed Willink on Oct 05, 2020 02:50
(In reply to Ed Willink from comment #14)
Bug 567383 handles the evolution to make the development acceptable to the Xtext project.
By Ed Willink on Oct 08, 2020 04:57
(In reply to Ed Willink from comment #15)
But this ignores the real challenge; ... reformatting should respect the user's preferred ordering to
Serialization(+formatting) as required for OCLinEcore to format from *.ecore is very different to what is required by typical Xtext formatting that pretifies text layout.
Serialization must discover the grammar mapping before formatting normalized content. Xtext comments are only available if a Node model is available from a parse.
Formatting must re-use the accidental user ordering. This is all part of a parse with ordering.
Re-using too much of the serilaization rules approach for formatting requires unpleasant run-time searches to correlate serialzation rule steps with rule nodes as a prelude to rescuing comments.
If instead we create separate grammar rule metadata to map each grammar element to its serialization segments, formatting is easy just traverse the nodes in linear order to access the grammar element and so serialization segments; no serialization rules to discover. Cost perhaps 10% increase in SerializationMetaData.
But the grammar rules have alternatives, which serialization rules don't so some pathological grammars may defeat idiom matching.
silly: ('('|'<') x=X (')'|'>');
may not reliably match the '<'...'>' idiom with out run-time checking.
Tough; the formatting may be slightly wrong. Maybe we can emit a diagnostic recommending avoiding subidiom matching within alternatives.
silly: silly1 | silly2;\ silly1: '(' x=X ')';\ silly2: '<' x=X '>';
The existing formatters probably can't enforce no spacing inside possibly paired <> either.
By Ed Willink on Nov 25, 2020 13:15
Pushed to master for M3.
In use for OCL, QVTd. Further work needed to avoid a stack overflow from Xtext's recursion that parses (X) as X.
By Ed Willink on Nov 29, 2020 12:06
(In reply to Ed Willink from comment #18)
Further work needed to avoid a stack overflow from Xtext's recursion that parses (X) as X.
Discussed in Bug 569273.
By Ed Willink on Dec 08, 2020 11:01
(In reply to Ed Willink from comment #18)
Pushed to master for M3.
Reverted for RC2. I spent too much time after M3 on the advanced problem of Xtext's eccentric cyclic rules rather than the mundane details of polishing the basic OCL/QVTd support. Cannot fix enough for RC2 so mustrevert.
By Ed Willink on Dec 29, 2020 11:13
Another horror. After fixing the idioms grammar to allow "with x.y.z.Z" just like the Xtext grammar, the compound datatype rule for QualifiedID fails, omitting elements or spacing them out. Amazingly even Java allows
package x/*/./\ */y;
so clearly formatting of each datatype rule term is necessary. Currently we just use node.getText() attempting to treat the compound as a monolith.
By Ed Willink on Jan 20, 2021 10:20
Now ready for OCL/QVTd M2, but ...
This started as a nice simple idiom-driven replacement serializer prettifier.
But it grew to encompass the whole serializer since the prevailing intermediate was unhelpful.
Oops. Has to do the formatter too which encounters considerable pain with the unhelpful Node model design e.g. who wants SyntheticNodes?
Oops. Should do arbitrary grammars such as Xtext that are not well-behaved (shift-reduce/reduce-reduce free LALR-wise). Xtext has no fewer than 4 production loops some involving nine productions, and these loops nest.
After a few more weeks struggling with the consequences of Xtext's loops I have an auto-generated XtextSerializationMetaData.java that successfully serializes a *.xtext. Not very pretty because:
a) the nested loops cause gratuitous parentheses. Fixable with stronger static analysis to prefer useful terms.
b) irregular serialization of STRING is required, see Bug 570497. Solved with a custom SerializationStringSegment.
c) the AbstractRule.returns is always present in the AS but the corresponding ecore import declaration may be missing. Since always present, it is always serialized, and without an ecore import, a global scope is misformatted. Needs a custom feature content filter capability.
... and no doubt more to be discovered.
... and line wrapping is not yet implemented.
Realistically support for badly-behaved grammars is taking far too much extra time and my grammars are well-behaved. If the idiom-driven serializer/formatter is promoted to Xtext, there will be a non-trivial support burden as new users come up with yet more variants of badly-behaved grammars. Providing this support is not in my plans, and I very much doubt it is in the Xtext team's plans.
Therefore continuing the struggle with badly-behaved grammars seems unjustified even if the extra support adds flexibility albeit at a performance cost.
Bottom line. Contributing to Xtext may require another six months work over the next two years. Polishing for well-behaved grammars only may need just a month.
By Ed Willink on Oct 18, 2021 15:46
The Declarative Formatter/Serializer is in use for OCL/QVTd that have lossless T2M.
Irritatingly Idioms.xtext has 8 benign errors.
The final ewillink/564265 branch reverts the new Serializer at the expense of on test failure. Of course this unfixes the comment formatting and reverts to the old Xtext infrastructure.
So we keep the development.
Bug 567161 fixes the Idioms.xtext errors.
By Ed Willink on Nov 15, 2023 10:39
https://github.com/eclipse/xtext/issues/2811 \ "Say goodbye to Xpand" prompts a revist as the SimRel build breaks through OCL's use of the old xtext.generator.
Not clear why this issue was RESOLVED given that development was put on hold with too many buigs appearing as RC1 loomed.
Reactivating and putting in another onth's developement it looks plausible enough to push to master for M3. Serialization is good, formatting needs a fix to avoid losing comments. Lin-wrapping still to do.
| --- | --- | | Bugzilla Link | 564265 | | Status | REOPENED | | Importance | P3 enhancement | | Reported | Jun 13, 2020 03:51 EDT | | Modified | Nov 15, 2023 10:39 EDT | | Depends on | 439440, 564254 | | Blocks | 567383 | | See also | 563046, 565412, 405184, 567161, 569273 | | Reporter | Ed Willink |
Description
Migration of OCL from the old generator to the new forced me to look at migration from the old formatter to the new and conclude that, with Xtend dying, the new formatter is a mostly retrograde step. See Bug 563046. However since the new formatter is more imperative it is more debuggable and so more inspiring.
Previously I have advocated, https://github.com/eclipse/xtext/issues/1721, that it would be nice to mark up formatting in an alternate Xtext editor view. That is declarative and much better but misses a major principle of software engineering; use an abstraction to avoid repetition.
Let us consider a new design based on the formatting idiom abstraction.
The commonest idiom could be declared in a definition DSL as:
idiom default(token : Token) {\