alan-if / alan

ALAN IF compilers and interpreters
https://alanif.se
Other
18 stars 3 forks source link

Compiler Bugs with Block Comments #30

Closed tajmone closed 3 years ago

tajmone commented 3 years ago

@thoni56, I've come across some odd bugs with block comments.

Using ALAN 3.0beta8 build 2207 under Win 10, tested with both CMD and Bash for Windows (same result); source file is UTF-8 BOM, using native CRLF EOL.

Compiled using both alan sample.alan and alan -encoding utf8 sample.alan, same results (the problem is not autodetection of UTF-8 via the BOM).

The problem is due to the fact that the source file has CRLF line ending, if I switch to Unix style LF it works fine (but only if I add a blank line at the beginning).

So there seem to be different problems at stake here: incorrect handling of CRLF EOL (even under the CMD), and a bug when a source file starts with a block comment.

Below are the actual error reports, although they don't really pin-point the problem, but they might help you gain insight on what goes on behind the scenes...

Error 1

With a source file starting with this block comment:

//// "The Barracks" | Alan Beta8 ///////////////////////////////////////////////
A sample ALAN adventure, by Tristano Ajmone,
////////////////////////////////////////////////////////////////////////////////

I get the following compiler errors:

sample.alan

    1.  //// "The Barracks" | Alan Beta8 ///////////////////////////////////
        ////////////
=====>  1    2              3 4

  *1*   103 E : Syntax error. Replacing "/ / / /" with "start here .".
  *2*   211 E : Adventure must start at an instance inheriting from 'location'.
  *3*   102 F : Syntax error. Ignoring "Unknown Token".
  *4*   102 E : Syntax error. Ignoring "<identifier> <identifier> / / / ...".

    3.  ///////////////////////////////////////////////////////////////////////
        /////////
=====>  1

  *1*   155 E : Unterminated block comment. Must end with a line consisting of
                at least four slashes and nothing but slashes.

        5 error(s).
        No detected warnings.
        2 informational message(s).

Note that the error at line 1 contains a  char, so it seems

Error 2

If I add an empty line at the source start, right before the comment block:


//// "The Barracks" | Alan Beta8 ///////////////////////////////////////////////
A sample ALAN adventure, by Tristano Ajmone,
////////////////////////////////////////////////////////////////////////////////

the compiler error changes slightly:

sample.alan

    2.  //// "The Barracks" | Alan Beta8 //////////////////////////////////////
        /////////
=====>  1

  *1*   155 E : Unterminated block comment. Must end with a line consisting of
                at least four slashes and nothing but slashes.

   66.
=====>  1

  *1*   101 E : Syntax error. Inserting "start here ." before this token.
  *1*   211 E : Adventure must start at an instance inheriting from 'location'.

        3 error(s).
        No detected warnings.
        1 informational message(s).
tajmone commented 3 years ago

As for the error:

  *1*   155 E : Unterminated block comment. Must end with a line consisting of
                at least four slashes and nothing but slashes.

It seems like the compiler considers the CR that precedes LF in the EOL as an additional character, making the closing delimiter invalid.

thoni56 commented 3 years ago

After just a brief look, I think this is caused by the regex for block comments includes a leading newline as they should only be allowed in the first column. But this does not fly in the first line in a file.

If you start the block comment after an empty line it compiles. I'll figure out a way to handle this case.

I'll investigate the CRLF problems.

thoni56 commented 3 years ago

This should be fixed in build 2209.

tajmone commented 3 years ago

After just a brief look, I think this is caused by the regex for block comments includes a leading newline as they should only be allowed in the first column. But this does not fly in the first line in a file.

In my ALAN syntaxes I've used the following RegExs for the opening and closing delimiters: ^\/{4}.*$ and ^\/{4,}$, relying on the line beginning anchor ^ instead of the \n.

I'll investigate the CRLF problems.

If the RegEx uses \n to match the line-end, than it might fail with CRLF in some RegEx engines (including PCRE), for \n might match LF only, which would make the preceding CR an extra char that disqualifies the closing delimiter. Using $ (or [$\n]) should be safer.

thoni56 commented 3 years ago

This is in the scanner generator so not all "standard" regex symbols are supported.

tajmone commented 3 years ago

I confirm that now everything is working as expected!

This is in the scanner generator so not all "standard" regex symbols are supported.

I imagined so. No idea how you worked around the lack of a ^ then.

I would have thought that the scanner would strip away the EOL sequence, or at least normalize CRLF to LF, since dragging around the extra CR could potentially break up things in various places.

thoni56 commented 3 years ago

I confirm that now everything is working as expected!

Good! Thanks.

This is in the scanner generator so not all "standard" regex symbols are supported.

I imagined so. No idea how you worked around the lack of a ^ then.

Programming! ;-)

I would have thought that the scanner would strip away the EOL sequence, or at least normalize CRLF to LF, since dragging around the extra CR could potentially break up things in various places.

The scanner is not a text processor but a tokenizer and there are no tokens containing any newline characters so there are no CR:s to be dragged around. So all CR and LF are effectively stripped from input. Strings, which may spann lines, are stripped of them before returning them as tokens to the parser.

There are technical reasons why the file reading cannot be done in "text mode" (which otherwise automatically converts any encountered CRLF to \n) and thus reveals the CR (which will be matched and removed).