[Feature debate] Remove column constraint on assignments and macro calls

ISSOtm commented 4 years ago

We've all had this before.

my_macro: MACRO
    PRINTT "Wheeeeeeee"
ENDM

my_macro

ERROR: a.asm(5):
    'my_macro' already defined in a.asm(1)

 label:

ERROR: b.asm(1):
    Macro 'label' not defined

Countless users, newbies or not, have been bitten by this before. It would be much better if labels could be declared at not-column-1, and macros called at column 1.

HOWEVER! This is there for a reason: RGBASM allows declaring labels without a colon. This cannot be distinguished from a macro call, so the column is used to decide. In practice, labels are usually defined using colons, lifting the ambiguity; there is an exception for local labels, but the presence of a dot removes the ambiguity too (see #423). Even if the dot is in the middle of the label, it's still unambiguous.

tl;dr: the column restriction can be removed, potentially saving a lot of headaches, but then it will break code that declares labels without colons. Is this worth it?

PS: I realized that #362 being fixed may remove a use case for the feature in sights., so that's good.

pinobatch commented 4 years ago

If anything, this would reduce the need for my dedenter script (which was the motivation for #305).

it will break code that declares labels without colons.

Only if a label shares a name with a macro or mnemonic, correct? Otherwise, a construction analogous to ca65 .feature could be used to make the removal of indentation sensitivity opt-in.

Would the following decision rules be helpful?

Colon or period at expected point in the line: Assume label (though watch out for ambiguity caused by floating-point literals)
Column 1: Assume label first, then macro or mnemonic
Other columns: Assume macro or mnemonic first, then label

ISSOtm commented 4 years ago

I looked into the yacc conflicts, and this would solve one (related to set being either an instruction or a variable creation).

aaaaaa123456789 commented 4 years ago

I'd go one step further and completely disallow colonless non-locals. You might want to have a compatibility mode for a while, but the benefits of this alleged feature are so minimal and the fix is so trivial that you might as well get rid of it.

ISSOtm commented 4 years ago

Maybe I wasn't clear enough, but I am in favor of removing colonless non-local label declarations. My question is more about how much usage it is currently getting.

It might be possible to warn users of colon-less label declarations for a while, but I'm sadly not sure how to implement it on a grammar level.

AntonioND commented 4 years ago

Can't you print the warning when you add the label to the list of symbols?

ISSOtm commented 4 years ago

How so? Isn't the presence of the colon detected at parsing time?

AntonioND commented 4 years ago

Isn't the colon saved along with the rest of the string? I have never actually checked.

ISSOtm commented 4 years ago

Nope.

I have looked more into this, and it appears that a big chunk of the problem is that the lexer returns a different token depending on the column number. I have tried changing this for a quick try, and even our own test suite fired back—it uses a ton of colon-less labels. Considering this would be a big change and we already have a bunch of PRs open (and I don't want to have to resolve conflicts), I rolled back everything.

ISSOtm commented 4 years ago

This is actually impossible to perform with the current grammar, if only as the following proves:

; Define a variable
variable equ 0

uniqdef: MACRO
variable\@ \1 \2
ENDM
; Call a macro
uniqdef equ 0

There is an ambiguity in the language itself—not just in the grammar.

@aaaaaa123456789 suggested replacing the X equ VAL syntax with equ X VAL (which is apparently more common in other assemblers); both should be able to be supported for at least one version, with deprecation warnings.

I implemented this for a shot, and we have a problem: set. This token is highly ambiguous (is it an instruction with a label declaration, or a variable definition? Even moving it to the beginning of the line keeps the ambiguity until you see a comma). While technically the syntax is unambiguous, Yacc considers it a conflict. If there is a way to tell it how to lift the ambiguity, I would love to hear it. If that's not possible, set as a variable declarator could be deprecated entirely in favor of =.

aaaaaa123456789 commented 4 years ago

As I said before, when you break existing code, it doesn't really matter by how much you break it. Something like @set foo, 42 will do.

ISSOtm commented 4 years ago

If we break it, might as well remove set entirely, given how confusing it is.

AntonioND commented 4 years ago

If we break it, might as well remove set entirely, given how confusing it is.

Agreed.

ISSOtm commented 4 years ago

After more implementation tweaking and discussion, it looks like this might be a hard barrier for RGBASM.

Here is a summary of what transpired:

Goal

The reason why this fix is wanted is because column-1 rules create a newbie trap, with obscure error messages (ranging from the infamous terse syntax error to macro not defined), and the limit seems really arbitrary anyways.

The problem that invariably arises when attempting to remove this limitation is a bunch of grammar conflicts. But those occur because of ambiguities in the language itself! Consider the following line:

sym EQU val

Under current RGBASM, this is without a doubt defining immutable variable sym. But if we lift the restriction, is it still a variable definition, or are we calling macro sym with argument EQU val? Sure, one could be prioritized over the other if needed, but what if the user wants the other one for whatever reason? Worse still—disambiguation is impossible, because the lexer state is switched as soon as the T_ID for the macro invocation is encountered. So lookahead is impossible, and either macro invocations or symbol definitions are possible, but not both.

Worse still, label declarations themselves become unreliable. Consider this line:

 bar:
; Note the whitespace!

Under current RGBASM, this does call macro bar with argument : (obviously: the lexer sees the T_ID token ending at the colon). But it's non-ambiguous, given that the whitespace enforces this as a macro call. But if the restriction is lifted, then...

Solutions

(╯°□°)╯︵-┻━┻

Maintain the statu quo. After all, it's not that bad, right? Maybe the error messages can be improved. (Especially the syntax error, but that one has a broader scope)

Label rework

A suggested resolution would be to require colons to be stitched to the T_ID, and if so turn it into a T_LABEL [getting rid of colon-less labels]. This solves the issue with defining labels, but not the other one. This could be fixed by changing the definition syntax: EQU sym val. (The current syntax would be deprecated using a warning for at least one release, also allowing to spot even placed where the definition is generated, eg. in a macro).

The two parts I don't like about this are more mucking with the lexer, which I would be okay with if it was a Flex definition, but it's not, it's hand-modified generated C now. (~~I should open an issue about my plans regarding this, but I digress.~~ See #485) The second part is that the macro invocations are too "broad", and cause conflicts as soon as a T_ID is encountered first thing on a line. (Or after a label.)

Macro invocation token

The solution to that would be to require all macro calls to be prefixed with a special token (I was thinking of a #, for example). The upside is that all is disambiguated in one fell swoop, the downside would be the number of macro call sites that would require changing in large projects.

As with the previous option, there would be a need for changing a lot of locations in the code, but I believe that the bulk could be automated by parsing the warning message output. I would personally prefer this option because it allows more leniency in future extensions to the grammar.

Alternatives

@pinobatch made ca83, a macro pack for ca65 that allows it to process RGBDS-inspired syntax (still using ca65 features, expression syntax, macro invocations, and so on). There is also HGBASM by @Hawkbat, with very good compatibility. If someone isn't comfortable with RGBDS' limitations it's possible to look for alternatives, without going as far as switching to eg. wla-dx.

As for the specific issue we're dealing with, @pinobatch also made a preprocessor that, assuming all label declarations end with a colon, de-indents them to column 1 where RGBASM can process them.

Considering alternatives exist, do we want to break compatibility to fix this?

AntonioND commented 4 years ago

A suggested resolution would be to require colons to be stitched to the T_ID, and if so turn it into a T_LABEL [getting rid of colon-less labels].

I actually like this idea. It's really annoying to fix code that doesn't comply with the new syntax, but it can be fixed automatically for the most part, and it gets rid of that inconsistency (I never liked the idea that you can omit the colon from labels).

The solution to that would be to require all macro calls to be prefixed with a special token

This is annoying, and not a great solution because many macros try to hide the fact that they are macros. For example, several codebases have macros to emulate djnz by having two instructions. Needing a token would break this illusion. On the other hand, maybe it's a good idea to not hide things...

Overall I think this is too restrictive.

ISSOtm commented 4 years ago

I personally heavily disagree with pseudo-op macros, but people should be able to use them if they want to. So then we'd have to fall back to integrating colons in labels (which is something I'm implementing as part of re-doing #437 better), which means editing the lexer and thus this is blocked by #485 at least for me.

AntonioND commented 4 years ago

Oh, yeah, I dislike the pseudo-op macros, for sure, but people are going to use them anyway so why not allow them...

ISSOtm commented 4 years ago

To clarify the commit just above: after discussion in #423, it was decided that only labels could be scoped. (If we want to switch to more general, e.g. ca65 scoping, we'll consider this later)

Thus, it made the T_LOCAL_LABEL token kind of redundant T_LOCAL_ID; thus, the "column-1" restriction has been lifted on those. (Though, this might not work with the Global.local form, I should check.)

JL2210 commented 4 years ago

Similar happens for SIZE = 11; you can't have whitespace before it or it's treated as a macro.

JL2210 commented 4 years ago

If this is fixed, some tutorials might have to be changed, like https://eldred.fr/gb-asm-tutorial/syntax.html and maybe https://teamlampoil.se/book/gbasmdev.pdf

Rangi42 commented 3 years ago

The column constraint has been removed on labels, but not on constant definitions or macro calls. Relevant discussion is in #635.

Rangi42 commented 3 years ago

Taking this as the "new prefix-keyword constant definition syntax" issue:

As I said in Discord #asm, I'd prefer a prefix DEF syntax to the other suggestions. EQU FOO 42 would change the order of an existing keyword, and reads awkwardly compared to "foo equals 42". EQU FOO = 42 and SET FOO = 42 would be worse, combining the previously-different-meaning EQU and =. Whereas prefix DEF goes with 0.5.0's REDEF, and could easily be added to existing code:

DEF FOO EQU 42
DEF BAR EQUS "spam"
REDEF BAR EQUS "eggs"
DEF x = 10
REDEF x = x + 1
DEF ID RB 3
DEF WORD RW
DEF square(x) = x**2  ; upcoming user-defined functions

Regarding =, I think it would be worth allowing that without a DEF or REDEF. Constants are defined only once, but SET values often get redefined and are used for quick small things like counters in a REPT block. The 0.5.0 lexer already checks for a trailing : to tell labels apart from macros, so it could also check for a single =. This would mean that mac = 42 gets lexed as an assignment, not a macro usage, but there are plenty of workarounds: mac /*hack*/ = 42, or mac {hack} = 42 with hack EQUS "", or mac \ = 42 (line continuation). (And it's very unusual to need such a macro anyway.)

ISSOtm commented 3 years ago

Actually, {hack} would not work, but /* hack */ and consorts may not, depending on future changes and/or refactorings. A more reliable workaround would be the addition of a dummy first argument (possibly empty), that then gets either ignored, or shifted away.

aaaaaa123456789 commented 3 years ago

You once suggested adding a character to indicate a macro call. While I'm strongly against requiring such a character, adding an optional character to signal a macro call would solve all of these problems — just mark the line as a macro call and there's no parsing ambiguity. Since all of the parsing ambiguities like this one are corner cases, that's not really an issue.

Rangi42 commented 3 years ago

mac = ... is a rare enough use case that, if ident = ... is special-cased to lex as T_LABEL T_OP_EQUALS ... assignment instead of T_ID T_OP_EQUALS ... macro usage, then I'd rather not build new punctuation syntax into the language just for that. At least one of the workarounds (/*hack*/, \ line continuation, dummy first argument) should always be possible if you really need to pass = to a macro.

Edit: a workaround could also be encapsulated in its own macro which would serve the same purpose as such a punctuation character.

macro apply
    def ___mac equs "\1"
    shift
    {___mac} /*hack*/ \#
    purge ___mac
endm

macro mac
    println "6 * 7 \1"
endm

; would print "6 * 7 = 42"
apply mac = 42

Edit 2: Then again, there's no use case for character escapes outside of strings, and \mac = 42 could unambiguously be handled in normal mode. The backslash could fall into the default case to handle identifiers that's already below it, after setting a local flag to make it not be returned as a T_LABEL.

        /* Handle escapes */

        case '\\':
            c = peek(0);

            if (c == ' ' || c == '\r' || c == '\n') {
                readLineContinuation();
                break;
            } else if (c == EOF) {
                error("Illegal backslash at end of input\n");
                break;
            } else {
                preventLabel = true;
            }
            // fallthrough

        /* Handle identifiers and escapes... or error out */

        default:
            if (startsIdentifier(c)) {
                ...

                if (tokenType == T_ID && !preventLabel &&
                    (lexerState->atLineStart || peek(0) == ':'))
                    return T_LABEL;

                return tokenType;
            }

Even before removing the column constraint, this would be potentially useful to pass a first argument starting with : without whitespace:

macro mac
    println "<\#>"
endm

    mac + hello  ; prints <+ hello>
    mac : hello  ; this would be a label followed by an identifier
    \mac : hello ; would print <: hello>

ISSOtm commented 3 years ago

Despite that b809384 mentions closing this issue only once the old syntax is removed, it's one of the core components of RGBASM, and thus is scheduled to stay for a while. Thus, I'll be closing this, and we'll open a separate issue to discuss deprecation and/or removal of the old syntax when we get there.

As far as I'm concerned, due to how breaking a change it would be to remove that syntax, I would prefer to keep it as long as it doesn't become a significant blocker / maintenance burden. (As it happens now, it's just T_LABEL support in the lexer, and a few extra parser rules: not much.)

Rangi42 commented 3 years ago

I'm also fine with keeping the old syntax for EQU/EQUS/=/etc, now that DEF allows them to be defined with indentation.

The only limitation of keeping them around is that it prevents macros from being invoked at column 1, but there's less demand for that than there is for indented definitions anyway.

ISSOtm commented 3 years ago

Only thing to keep track of is updating the FAQ for 0.5.0, to mention the new syntax for macro definitions, and the fact that they fix the problem.

Rangi42 commented 3 years ago

Both the old and new syntaxes for defining macros, mac: MACRO and MACRO mac, can be indented. But then mac must be indented to use it.

gbdev / rgbds