[parser] Eliminate CSTToken.startOffset/endOffset

eclipse-ocl-bot commented 1 month ago

| --- | --- | | Bugzilla Link | 297606 | | Status | CLOSED WONTFIX | | Importance | P3 normal | | Reported | Dec 11, 2009 12:50 EDT | | Modified | May 27, 2011 06:41 EDT | | Blocks | 318368 | | Reporter | Ed Willink |

Description

Following the fix for NUMERIC operations, there is no need for CSTToken to have both startToken/endToken and startOffset/endOffset. The offsets can be derived by startToken.getStartOffset(), endToken.getEndOffset().

A quick experiment changing startOffset/endOffset to derived,readonly demonstrates that there are quite a few calls to setStartOffset/setEndOffset since those are what parsers used to have to call.

This is therefore a breaking change, so for now all we can do is deprecate the two setStartOffset/setEndOffset methods. I'll attach a patch that deprecates these methods and removes the deprecated calls. I'll submit a patch to QVTo for its calls.

We ought to wait for MDT/OCL 4.0.0 before removing these methods and fields completely.

eclipse-ocl-bot commented 1 month ago

By Radomil Dvorak on Dec 11, 2009 14:18

(In reply to comment #0)

Following the fix for NUMERIC operations, there is no need for CSTToken to have both startToken/endToken and startOffset/endOffset. The offsets can be derived by startToken.getStartOffset(), endToken.getEndOffset().

A quick experiment changing startOffset/endOffset to derived,readonly demonstrates that there are quite a few calls to setStartOffset/setEndOffset since those are what parsers used to have to call.

This is therefore a breaking change, so for now all we can do is deprecate the two setStartOffset/setEndOffset methods. I'll attach a patch that deprecates these methods and removes the deprecated calls. I'll submit a patch to QVTo for its calls.

We ought to wait for MDT/OCL 4.0.0 before removing these methods and fields completely.

Ed, the offset integers allow to make CST model more reusable in other contexts, where the tokens of LPG parser may not be available.\ I rather do not understand why the model needs to carry the data of a particular parser implementation, along with its buffers, etc.\ To my understanding, I should able to deduce the tokens from the offsets via the used parser, so why a deprecation in this direction?

Also, when looking into the code, who is using the tokens from the CST model?

eclipse-ocl-bot commented 1 month ago

By Ed Willink on Dec 11, 2009 15:08

The use of tokens was introduced to make AST to CST to Token to Chars navigation much easier for editors. Without the Token objects the location of the chars required unnecessary searches to recover lost context.

The CST is an 'internal' model for communication between the parser and the analyzer, so alternate CST sources seem unlikely, but if you have a specific need then we'll take account of it. Orginally the CST packages had an internal in them, this was removed to allow derived parsers access. It was not inttended to allow independent creation.

Glancing at the QVTo code that is affected by the possible deprecation, it's just a matter of deleting the getStartOffset()/getEndOffset() calls in all the setOffsets() calls in the grammar.

eclipse-ocl-bot commented 1 month ago

By Radomil Dvorak on Dec 11, 2009 16:12

(In reply to comment #2)

The use of tokens was introduced to make AST to CST to Token to Chars navigation much easier for editors. Without the Token objects the location of the chars required unnecessary searches to recover lost context.

I remember that time when it appeared, and I was not much happy to see that.\ Actually, it was detected by GMF Xpand running into out of memory problems.\ I looked for usages and found it was just set and nothing more. In bugzilla I found it got there as a sort of optimization for an IMP-based editor. Well,\ ok if one side is happy but there are many ways how to do things, so others may \ have taken other ways.\

The CST is an 'internal' model for communication between the parser and the analyzer, so alternate CST sources seem unlikely, but if you have a specific need then we'll take account of it. Orginally the CST packages had an internal in them, this was removed to allow derived parsers access. It was not inttended to allow independent creation.

Users might have different ideas, once it's part of API, it's hard to guess the intent in this case.\ As an example, one might want to implement incremental parsing to speed up reconciling in editors. In a lang like QVTo it makes a perfect sense to do that\ in the scope of operation. Its body, if modified, can cause re-parse of only the affected operation, resulting in a new partial CST. This can be easily incorporated into the whole CST as we can just shift offsets accordingly, getting a consistent CST. If tokens are there, especially used to derive integer offsets, I'm in troubles. With position only base CST, it's OK.\ Obviously, AST is fine as we keep the operation element already resolved by others, but replace only its internals. Does that make sense?

What I wanted to say is that one optimization can kill another optimization. Better if people have an option to do great features in different ways. QVTo editor has codesense, semantical highlighting without CSTNode:tokens and IMP, so it's clearly not a must.

Glancing at the QVTo code that is affected by the possible deprecation, it's just a matter of deleting the getStartOffset()/getEndOffset() calls in all the setOffsets() calls in the grammar.

As per QVTo in CVS yes, but there are other consumers, not only at Eclipse.\ I guess GMF Xpand integration might be interested.

eclipse-ocl-bot commented 1 month ago

By Ed Willink on Dec 11, 2009 16:43

Actually if you use tokens optimisation gets easier. Replacement tokens can describe text under control of an alternate parser avoiding the need to change any indexes at all; just so long as you always ask the token which lexer stream it refers to.

After loking at the QVTo code more closely there are few routines such as createStepOutCS that (needlessly) pass offsets rather than tokens, requiring minor signature simplifications to eliminate the setOffsets(). There appears to be one place where offsets and tokens are inconsistent. In OCL, this was an indication of a lexer bug.

For Helios, this change will at most deprecate setStart/EndOffset and realise getStart/EndOffset by using the start/endOffset if either start/endToken is null, but computing it from start/endToken otherwise. If the deprecations provoke significant protest we'll just undeprecate. If both setStartToken and setStartOffset are used, the setStartToken will now have precedence. Further protection might by afforded by an IllegalStateException for invoking setXxxOffset after setXxxToken and vice-versa.

This latter computation allows a longstanding uninvestigated funny to be resolved. (See Bug 277620). The implicit API of getStartOffset()/getEndOffset() was that

 getStartOffset() <= getEndOffset() + 1

which was true till the backtracking parser was used. The backtracking parser swaps start and end offsets for ErrorTokens, invalidating the above. A variety of bodges have been tried to avoid the bad string length/index exceptions.

The CSTNode-comments will document that end offset is inclusive and that getStartOffset() <= getEndOffset() + 1, requiring CSTNodeImpl.getStart/EndOffset to undo the ErrorToken interchange before allowing consuming code to see it. Hopefully this will fix the funnies. Still working on this.

eclipse-ocl-bot commented 1 month ago

By Ed Willink on Jun 26, 2010 10:43

The planned evolution to use of Xtext and consequently ANTLr for editing, while retaining LPG for fast AST generation suggests that this bug should change direction.

The LPG CST should be as lightweight as possible while aligning slightly with the Xtext CST.

It therefore seems appropriate to head towards just startOffset/endOffset once again.

eclipse-ocl-bot commented 1 month ago

By Ed Willink on Feb 01, 2011 02:28

The LPG parser is now mature. It is not obvious that this change is practical anyway.

eclipse-ocl-bot commented 1 month ago

By Ed Willink on May 27, 2011 02:26

Closing.

eclipse-ocl-bot commented 1 month ago

By Ed Willink on May 27, 2011 06:41

Resolved for Indigo is 3.1.0 not 3.2.0.

eclipse-ocl / org.eclipse.ocl

[parser] Eliminate CSTToken.startOffset/endOffset #481

Description