DEVSENSE / Parsers

https://www.devsense.com
Apache License 2.0
26 stars 6 forks source link

MacOS/Linux Specific Problem #6

Closed nabsul closed 7 years ago

nabsul commented 7 years ago

Hi,

First off, great Library! I've found it extremely useful. I've run into a rather strange issue that only happens on MacOS/Linux machines. The Lexer is getting stuck in an infinite loop on the piece of PHP code listed below.

On windows the parser works fine, but on Linux, right after "closure" => Closure, I'm getting stuck in an infinite loop of T_ENCAPSED_AND_WHITESPACE tokens. Whereas on Windows it moves on to a T_CURLY_OPEN token.

Any ideas why this would happen?

        $this->assertStringMatchesFormat(
            <<<EOTXT
array:24 [
  "number" => 1
  0 => &1 null
  "const" => 1.1
  1 => true
  2 => false
  3 => NAN
  4 => INF
  5 => -INF
  6 => {$intMax}
  "str" => "déjà\\n"
  7 => b"é\\x00"
  "[]" => []
  "res" => stream resource {@{$res}
%A  wrapper_type: "plainfile"
    stream_type: "STDIO"
    mode: "r"
    unread_bytes: 0
    seekable: true
%A  options: []
  }
  "obj" => Symfony\Component\VarDumper\Tests\Fixture\DumbFoo {#%d
    +foo: "foo"
    +"bar": "bar"
  }
  "closure" => Closure {{$r}
    class: "Symfony\Component\VarDumper\Tests\CliDumperTest"
    this: Symfony\Component\VarDumper\Tests\CliDumperTest {{$r} …}
    parameters: {
      \$a: {}
      &\$b: {
        typeHint: "PDO"
        default: null
      }
    }
    file: "{$var['file']}"
    line: "{$var['line']} to {$var['line']}"
  }
  "line" => {$var['line']}
  "nobj" => array:1 [
    0 => &3 {#%d}
  ]
  "recurs" => &4 array:1 [
    0 => &4 array:1 [&4]
  ]
  8 => &1 null
  "sobj" => Symfony\Component\VarDumper\Tests\Fixture\DumbFoo {#%d}
  "snobj" => &3 {#%d}
  "snobj2" => {#%d}
  "file" => "{$var['file']}"
  b"bin-key-é" => ""
]

EOTXT
            ,
            $out
        );
    }
nabsul commented 7 years ago

I've been able to simplify exposing this issue with the following snippet:

       string stringToParse = "<?php  $x = <<<EOTXT\n  \"closure\" => Closure {{$r}\n  }";
            Lexer lexer = new Lexer(new StringReader(stringToParse), Encoding.UTF8, null, LanguageFeatures.Php71Set);
            Tokens token;
            while ((token = lexer.GetNextToken()) != Tokens.EOF)
            {
                Console.WriteLine($"Type: {token.ToString()}");
            }
jakubmisek commented 7 years ago

Thank you @nabsul !

nabsul commented 7 years ago

I'm definitely interested to learn what causes this issue, and please let me know if I can help in any way. Hopefully it's not something deep within the .NET core library.

Either way, as a temporary work-around in my project (to be published this week) I'm simply aborting when TokenPosition.Start gets stuck in the same place. In my situation this is fine because the project is analytics oriented and a few failed parsings out of several thousands is no big deal.

michalbrabec commented 7 years ago

Hi @nabsul, I have tested the C# snippet you sent, but it does terminate properly. It reported a syntax error - unfinished heredoc, but it did not loop indefinitely. Could you please attach the original file that caused the issue, including encoding, character set - everything. It is difficult to recreate otherwise, because mac has different line ends and sometimes encoding. Thanks.

nabsul commented 7 years ago

@michalbrabec It might take a day or two, but I'll whip up a .NET core command line project that demonstrates the issue, with a PHP file included and everything.

nabsul commented 7 years ago

@michalbrabec Here's a .NET core project that should very easily produce the error I'm talking about: https://github.com/nabsul/devsense-parser-test

I followed exactly those steps to run this program on my Macbook Pro and got the infinite loop behavior. Does this run fine on your mac?

I'm on dotnet core version 1.0.1. Can you confirm your version?

It'll take me a bit more time to dig for the exact file causing me the trouble. If you still need it I'll look for it later today.

petrroll commented 7 years ago

The issue is reproducible on WSL (Win10 1703 with Ubuntu 16.04) with dotnet --version 1.0.4.

PS: This might (or not) come in handy.

michalbrabec commented 7 years ago

Thanks @nabsul, that would help us a lot. Thanks @petrol for the link, I will look into it.

nabsul commented 7 years ago

@michalbrabec File added to the repo: https://github.com/nabsul/devsense-parser-test

petrroll commented 7 years ago

Still reproducible on .NET Core 2.0 with up-to-date Parser (1.3.49) (on Linux via WSL)

petrroll commented 7 years ago

The token it get's stuck on is the Start of heredoc and it's "solvable" trough putting a single space between it and the \n newline there.

Similarly for the FileTest can be "fixed" trough adding a single space after each <<<EOTXT. Thus it seems that the bug lies in the lexer's end-line detection on Linux/MacOS (not that surprising given the CLRF, ... differences) while finding the end of T_START_HEREDOC.

The "fix" makes the lexer completely skip T_START_HEREDOC, however, which is kinda logical since it is illegal to have a space after it AFAIK.

nabsul commented 7 years ago

Editing the input text before feeding it into the parser seems kind of hacky. Is there really no way to solve this in the parser?

petrroll commented 7 years ago

I'm not saying it's a solution to the problem. It's most surely not. Just wanted to provide some more info for the owners to help them fix it.

While GitHub says I'm a contributor I'm definitely not the person to fix it. I've just contributed a bit of code required for the peachpie project.

On Sep 3, 2017 21:43, "Nabeel Sulieman" notifications@github.com wrote:

Editing the input text before feeding it into the parser seems kind of hacky. Is there really no way to solve this in the parser?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DEVSENSE/Parsers/issues/6#issuecomment-326826783, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrtcWUA9rWOux2c4_9dsnGFVirql2dLks5sewFzgaJpZM4NvJOQ .

jakubmisek commented 7 years ago

\0 characters are "ignored" on linux (https://github.com/dotnet/coreclr/issues/2051)

Fixed in https://github.com/DEVSENSE/Parsers/commit/d86d5e4c861cd582c898e4ad009ca1df06db48fe by doing EndsWith properly

nabsul commented 7 years ago

Great news, thanks @jakubmisek !