Capital letters breaks autocomplete in VS Code Extension

drhagen commented 8 months ago

A grammar of a keyword followed by /[A-Z]+/ will not correctly autocomplete the keyword, but the same keyword followed by /[a-z]+/ will autocomplete just fine. This might be a bug on the VS Code side because the same grammar in the Langium Playground autocompletes fine.

Langium version: 2.1.3 Package name: hello-world

Steps To Reproduce

npm install -g yo generator-langium
yo langium
- Keep defaults except do not create CLI or webworker
- Accept open in VS code
Replace hello-world.langium with:
```
grammar HelloWorld
```

entry Model: 'header' value=ID;

hidden terminal WS: /\s+/; terminal ID: /[A-Z]+/;

4. Purge validation in `hello-world-validator.ts` because we don't need it:
```typescript
import type { HelloWorldServices } from './hello-world-module.js';
export function registerValidationChecks(services: HelloWorldServices) { }
export class HelloWorldValidator { }

npm run langium:generate
npm run build
Run extension in Code to open a new window with the extension installed
Create a file test.hello
In the file, try to auto-complete the first keyword he<tab>

The current behavior

When starting to type the keyword, the correct completion appears. But when pressing Tab or Enter to accept the autocomplete, it types in the whole keyword again instead of the remainder of the word.

Now switch ID from /[A-Z]+/ to /[a-z]+/. Rebuild and restart the extension. With this grammar autocomplete works as expected.

The expected behavior

Autocomplete completes the keyword instead of typing the whole keyword in again regardless of the token that follows.

msujew commented 8 months ago

Ok, fascinating. This is a really hard to catch edge case in very special grammars for completions within the first token of a file. I'm honestly suprised someone was able to create reproduction steps for this. Kudos, I guess. We basically run into this branch, which then later assumes that no tokens have been parsed. As a consequence it doesn't even attempt to fuzzy match the previous code to override it. This logic got fairly recently into Langium, whereas the playground lags behind a minor version, which is why it doesn't exhibit the behavior.

I'm not sure whether we can actually change this part of the logic though. The fuzzy matcher isn't allowed to look too far back in the token stream to find the text to replace. It should only look for the current token, which is exactly what's happening right now. In some cases, the current token just cannot be lexed, which leads to the behavior you're experiencing.

drhagen commented 8 months ago

within the first token of a file

I minimized this down, but failed autocompletion can trigger further than the first token, unless we have different definitions of "token".

For example, using this grammar:

grammar ReactionModel

entry ReactionModel:
    EOL? '%%' 'ReactionModel@2' EOL
    'initialization' '=' initialization=Initialization EOL
    '%' 'components' EOL
;

Initialization:
    InitialValue | SteadyState;

InitialValue:
    {infer InitialValue} 'initial_value' '(' ')';

SteadyState:
    'steady_state' '(' 'time_scale' '=' time_scale=FLOAT (',' 'max_scale' '=' FLOAT )? ')';

hidden terminal WS: /[ \t]+/;
terminal EOL: /((#.*)?\n[ \t]*)*(#.*)?((\n[ \t]*)|\Z)/;
terminal FLOAT returns number: /[+-]?\d+(\.\d+)?([Ee][+-]?\d+)?/;

with this valid file

%% ReactionModel@2
initialization = steady_state(time_scale = 1.0, max_scale=1.0)
% components

not a single keyword autocompletes correctly while typing it in or when going back to edit it. It knows what can be autocompleted there (e.g. after "initialization =" then "steady_state" or "initial_value" are valid autocompletes), but it types in the whole word instead of completing the word.

msujew commented 8 months ago

@drhagen Let me rephrase: For example initial - in your language - isn't actually a token (even though initial_value is), since there's neither a keyword nor something like an ID terminal that could lex it. Instead, the lexer simply ignores the characters. Since we can only know where a token ends/starts if the lexer recognizes it's a token, the completion provider assumes that the characters before the cursor position are invalid characters and ignores them as well. This is actually independent of the issue that we don't lex any tokens at all - the issue is really that we have no idea "how much" of a token already exists at a given point.

In order to successfully perform completion, even "broken" keywords need to be recognized as tokens by the lexer. Most languages (i.e. all that I've encountered so far) have an ID terminal that can be expressed as /\w+/, which automatically fixes this issue.

I don't think we can fix this as part of our framework. You are free to override how the completion provider attempts its fuzzy matching, so you should be able to fix this behavior for your language yourself.

eclipse-langium / langium