Warnings cursor position is offsetted & error types have inconsistent cursor offset

XenHat commented 7 years ago

This one is a little complicated:

Text version:

//jump.lsl
default
{
    state_entry()
    {
        jump X;
        llOwnerSay("Still here!");
        jump X;
        llOwnerSay("You can't see me");
        @X;
        llOwnerSay("OK");

    }
}

// lint_cursor.lsl
default
{
    state_entry()
    {
        llOwnerSay("Some String");
        xlSomeBogusUndefined();
        llOwnerSay("Another String");
        @Unused;
        key nya_k = (key)"nya";
        garbage
        llOwnerSay("OK");
    }
}

Warnings: If you look at jump.lsl. the warning for jump X is on jump instead of on X (off by one word to the left) If you look at @Unused, same as above, cursor is off by one word to the left If you look at key nya_k, same as above, cursor is off by one word to the left

Errors: If you look at xlSomeBogusUndefined() (Undeclared function), the cursor is correct If you look at garbage, cursor is one line under (!?)

Since the LSL build system (F7) also reports the same numbers and exhibits the same behavior, I am fairly confident this is cause by lslint, not my linter plugin regex.

I am unsure if this is intended to be wrong due to "Emulation" of the LSL compiler, but I would prefer my linter to report my errors correctly. I cannot/won't hack my linter plugin to correct this due to the gigantic regex required to encompass all quirks, assuming it is doable)

@Makopo @Sei-Lisa Do you think this is fixable?

EDIT: I just noticed lslint's -l switch. Am I supposed to use this? EDIT2: Even -l appears to be wrong, in yet more inconsistent ways:

ERROR:: (  6,  9)-(  6, 31): `xlSomeBogusUndefined' is undeclared.
ERROR:: ( 11,  9)-( 11, 19): syntax error, unexpected IDENTIFIER
WARN:: (  8,  9)-(  8, 17): label `Unused' declared but never used.
WARN:: (  9,  9)-(  9, 31): variable `nya_k' declared but never used.

Sei-Lisa commented 7 years ago

Can you please post a test case as text?

XenHat commented 7 years ago

Sure, I forgot about that. EDIT: Moved to OP

Sei-Lisa commented 7 years ago

Thanks. Warnings seem off by one line, following the viewer's line numbering that starts on line zero instead of the conventional numbering that starts on line 1. I will investigate that.

The error is actually more or less correct. Understanding why requires a bit of explanation.

Parsing is divided into two stages, lexical and grammatical analysis. The lexical analysis consists of grouping the input into units, called tokens, that form the basic syntactic elements. These elements do not typically depend on the context for them to be parsed, and can be isolated by reading the input file. Reserved keywords are distinguished from regular identifiers at this stage, and they are assigned different tokens.

For example, this script:

default
{
    state_entry()
    {
        llSay(0, "Hello, Avatar!");
    }
}

would be output by the lexical analyzer like this: TOKEN_DEFAULT_KEYWORD, TOKEN_OPEN_BRACE, TOKEN_STATE_ENTRY_KEYWORD (in LSL, event names are reserved words, and it's quite unique in that sense; most languages would parse that as an identifier and distinguish the kind at a later stage), TOKEN_OPEN_PARENTHESIS, TOKEN_CLOSE_PARENTHESIS, TOKEN_OPEN_BRACE, TOKEN_IDENTIFIER data="llSay", TOKEN_OPEN_PARENTHESIS, TOKEN_INTEGER data=0, TOKEN_COMMA, TOKEN_STRING data="Hello, Avatar!", TOKEN_CLOSE_PARENTHESIS, TOKEN_SEMICOLON, TOKEN_CLOSE_BRACE, TOKEN_CLOSE_BRACE, TOKEN_EOF.

Note the absence of tokens for spaces and newlines. These are skipped by the lexical analyzer; the grammar doesn't need them, so the lexer doesn't feed those to the syntax analysis.

The above example is not exact, but it's a quite good approximation. In practice, for brevity and convenience, the word TOKEN is usually implied and some single-character tokens are returned as the symbols themselves, rather than have a token word for them, though some double symbols like ++ or *= etc. do need token words. The actual sequence would therefore be:

DEFAULT, '{', STATE_ENTRY, '(', ')', '{', IDENTIFIER data="llSay", '(', INTEGER data=0, ',', STRING data="Hello, Avatar!", ')', ';', '}', '}', EOF

Then comes the grammar, which is a set of rules that govern whether a sequence of input tokens is acceptable. Grammars typically don't (or rarely do) look into the data associated to a token; they mostly care about whether the input is well-formed. The grammar constructs a tree, which will then be read by the compiler (or by lslint in this case). lslint has a command-line option to see the constructed tree.

A reduced grammar limited to parsing the above file and little else, using a syntax similar to the BISON format (but without the { } actions), would be:

start: states EOF;
states: state | states state; // you can read the | as "or", and yes, recursive rules are OK
state: DEFAULT state_definition | STATE IDENTIFIER state_definition;
state_definition: '{' events '}'
events: event_header event_body | events event_header event_body;
event_header: event_state_entry | event_touch_start; // just two events in this simplified grammar
event_state_entry: STATE_ENTRY '(' ')';
event_touch_start: TOUCH_START '(' TYPE IDENTIFIER ')';
event_body: '{' statements '}';
statements: |  // before the | is an empty element; having 'statements' emtpty is allowed
            ';' | expression ';' | statements expression ';';
expression: function_call | IDENTIFIER | STRING | INTEGER; // that's a very limited expression, for brevity
function_call: IDENTIFIER '(' parameter_list ')';
parameter_list: | nonempty_parameter_list; // empty parameter lists also allowed
nonempty_parameter_list: expression | nonempty_parameter_list ',' expression;

Parsing would proceed as follows. The first token fed by the lexer is DEFAULT. The parser starts at start. It expects states to follow. That's a non-terminal (i.e. not a token from the lexer; the name non-terminal means that in the tree, these represent nodes with children; the leaf nodes are the tokens and are also called terminals in the grammar). So it looks up states. The definition is state | states state. Since the second part hits a recursion at this point, and we don't already have a complete states at this point, the second part is not entered, so it takes the first branch. It looks up state, which is DEFAULT state_definition | STATE IDENTIFIER state_definition. Since the current token is DEFAULT and not STATE, the second branch doesn't apply. But the first branch is a match, so DEFAULT is accepted, and the next token is read, which is '{'. We're still in the first branch of state, but after DEFAULT, i.e. next thing to look up would be state_definition. That's defined as '{' events '}'. The '{' matches, so it reads the next token which is STATE_ENTRY. And so and so on, until either the whole sequence of tokens is read and accepted, or something that doesn't follow the rules is found (for example, if the first token was not DEFAULT or STATE, it would have immediately failed).

After this "Dragon book in 3 minutes" crash course on parsing, we're ready to understand what's going on with the second error :)

The problem here is that the grammar doesn't check what the identifier is and whether it's defined while it is checking syntax; as I said that's done at a latter stage. What the grammar sees is the output of the lexical analyzer, which starting at line 10 is IDENTIFIER, IDENTIFIER, '(', STRING, ')'...

The first identifier is valid syntax, because an identifier alone is a valid expression, which is a valid statement. You can have garbage = 0; or even garbage; and it would be valid syntax (then the semantics would fail when it turns out that garbage is not in the symbol table). But, from the point of view of the grammar, what is not valid syntax is IDENTIFIER followed by IDENTIFIER. In no instance where it finds an identifier can another identifier follow. Thus the error: it's complaining about the second IDENTIFIER because that's the point where an unexpected token was found.

The same would happen if garbage was actually an existing identifier. The parser can't tell the difference between an existing symbol and a non-existing one, because that's done at a latter stage, and the syntax error fires first.

With all of that taken into account, I'd say the garbage problem is not a problem. I'll now look into the off-by one in the warnings.

XenHat commented 7 years ago

Wow! thank you for the explanation. that is very clear now.

About warnings being off by one line; if you do fix the behavior, a switch would be appreciated to toggle "start at 0" on or off, as not all editors begin at line 0. Then I can ensure the linter gets results that start at 1. SublimeLinter has a parameter to offset the hint's location (line_col_base = (1, 1) is the default as most linters do output this way, but then the line output would then be wrong (to the user):

As long as the linter behaves somewhat consistently, I can tweak the offset and modify my regex to visually fix the quirks, but I cannot have a different line_col_base for warnings and errors.

XenHat commented 7 years ago

@Sei-Lisa If perspective helps, this is the setting I have access to. http://www.sublimelinter.com/en/latest/linter_attributes.html#line-col-base. As most linters start counting at 1, all I need is a consistent behavior.

Sei-Lisa commented 7 years ago

I misunderstood, sorry. It's not off by one line, it's reporting the correct line. The SL viewer is quite unusual in starting the line with 0, that's not a convention we should follow in this case, so that part is OK.

It's currently reporting the statement that causes the problem, rather than the identifier that causes the problem. That can be fixed.

For the jump case, in lslmini.cc change the macro HERE to IN(id) as follows: Change

      ERROR( HERE, W_MULTIPLE_JUMPS_FOR_LABEL, id->get_symbol()->get_name() );

to:

      ERROR( IN(id), W_MULTIPLE_JUMPS_FOR_LABEL, id->get_symbol()->get_name() );

(sorry not to format it as a patch, I'm waiting for the resolution of #7 because a patch would conflict).

For the other one reported:

diff --git a/lslmini.cc b/lslmini.cc
index 6f82499..c3bf612 100644
--- a/lslmini.cc
+++ b/lslmini.cc
@@ -186,13 +188,13 @@ void LLASTNode::define_symbols() {

 void LLScriptDeclaration::define_symbols() {
    LLScriptIdentifier *identifier = (LLScriptIdentifier *)get_children();
-   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_VARIABLE, SYM_LOCAL, get_lloc()) );
+   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_VARIABLE, SYM_LOCAL, identifier->get_lloc()) );
    define_symbol(identifier->get_symbol());
 }

 void LLScriptGlobalVariable::define_symbols() {
    LLScriptIdentifier *identifier = (LLScriptIdentifier *)get_children();
-   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_VARIABLE, SYM_GLOBAL, get_lloc()));
+   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_VARIABLE, SYM_GLOBAL, identifier->get_lloc()));
    define_symbol(identifier->get_symbol());

    // if it's initialized, set it's constant value

Edit: It's likely that there are more. It's a lot to check, so maybe it's better to fix those as problems are detected.

XenHat commented 7 years ago

Nice, thank you. Applied the code changes, and it seems to fix the warning problem for the jump case.

I notice that @Unused has the warning cursor set on the @ instead of Unused, is this another parser gatcha because there is no recognized identifier before (such as key on the line below)?

The warning in question is WARN:: ( 8, 9): [E20009] labelUnused' declared but never used.`

Also to note that the first apostrophe is a backtick instead, breaking github code formatting . :D

Sei-Lisa commented 7 years ago

Sorry, I forgot that one. Here's the patch:

diff --git a/lslmini.cc b/lslmini.cc
index 6f82499..c3bf612 100644
--- a/lslmini.cc
+++ b/lslmini.cc
@@ -261,7 +263,7 @@ void LLScriptEvent::define_symbols() {

 void LLScriptLabel::define_symbols() {
    LLScriptIdentifier    *identifier = (LLScriptIdentifier*)get_children();
-   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_LABEL, SYM_LOCAL, get_lloc()) );
+   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_LABEL, SYM_LOCAL, identifier->get_lloc()) );
    define_symbol( identifier->get_symbol() );
 }

XenHat commented 7 years ago

Works!

Sei-Lisa commented 7 years ago

state is also affected. Test case:

default{timer(){}}
state other{timer(){}}

Reports the unused state problem in state rather than in the state name. Fix:

@@ -212,7 +214,7 @@ void LLScriptState::define_symbols() {
       return;

    identifier = (LLScriptIdentifier *)node;
-   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_STATE, SYM_GLOBAL, identifier->get_lloc()) );    
+   identifier->set_symbol( new LLScriptSymbol(identifier->get_name(), identifier->get_type(), SYM_STATE, SYM_GLOBAL, identifier->get_lloc()) );
    define_symbol( identifier->get_symbol() );
 }

XenHat commented 7 years ago

Uh, isn't that patch replacing with the same?

Sei-Lisa commented 7 years ago

Global functions also affected. Test case:

integer x(){return 0;}
default{timer(){}}

Fix:

@@ -221,7 +223,7 @@ void LLScriptGlobalFunction::define_symbols() {

    // define function in parent scope since we have our own
    identifier->set_symbol(
-         new LLScriptSymbol( identifier->get_name(), identifier->get_type(), SYM_FUNCTION, SYM_GLOBAL, get_lloc(), (LLScriptFunctionDec*)get_child(1) )
+         new LLScriptSymbol( identifier->get_name(), identifier->get_type(), SYM_FUNCTION, SYM_GLOBAL, identifier->get_lloc(), (LLScriptFunctionDec*)get_child(1) )
          );
    get_parent()->define_symbol(identifier->get_symbol());
 }

Sei-Lisa commented 7 years ago

The pattern is very similar. As the header says, one affects LLScriptState::define_symbols() and the other LLScriptGlobalFunction::define_symbols().

I guess I should start making branches and submitting pull requests. I didn't want to give up my current fork from pclewis/lslint, but I guess I'll have to. This is getting too out of control.

XenHat commented 7 years ago

Ah. What I meant is that the patch for LLScriptState::define_symbols() doesn't seem to change anything.

Sei-Lisa commented 7 years ago

Sorry, I get what you mean. Yeah, definitely out of control :D

I better have some sleep now before I make more mistakes. I'll look into that tomorrow.

EDIT: Yeah, state was not affected. I definitely need some sleep.

XenHat commented 7 years ago

All right :)

Sei-Lisa commented 7 years ago

I'm unsure about the jump fix. The warning says: "Multiple jumps for label `X' - only the last will execute." It's warning about the fact that the jump statement itself won't execute. I guess that ideally it would cover the whole statement, both jump and label, and that this is not possible, but wouldn't it be more correct to point to the jump that won't execute, rather than to the label?

XenHat commented 7 years ago

True... Perhaps lslint should only return the line number without a column? SublimeLinter has a setting for this to highlight the entire line when column is missing.

Sei-Lisa commented 7 years ago

I don't think that would be adequate. Making an exception for this case seems going a bit over the top about it, not to mention that it can be a bit confusing; consider for example:

if (longish_condition) jump BreakLoop;

Marking the whole line would be somewhat confusing in this case. My vote is for leaving it as-is, reporting the jump.

XenHat commented 7 years ago

all right. makes sense when put like that. jump it is.

Makopo commented 7 years ago

Sorry for late response, I read this discussion now. I'll review the PR and merge soon. Thank you for you both effort!

Makopo commented 7 years ago

It seems ok and merged.

Makopo / lslint

Warnings cursor position is offsetted & error types have inconsistent cursor offset #9