antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.07k stars 3.27k forks source link

ATN cannot be deserialized in PHP-runtime when using example XML-language #4075

Open martinmolema opened 1 year ago

martinmolema commented 1 year ago

Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.

Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)


fragment
NameChar    :   NameStartChar
            |   '-' | '_' | '.' | DIGIT
            |   '\u00B7'
            |   '\u0300'..'\u036F'
            |   '\u203F'..'\u2040'
            ;

fragment
NameStartChar
            :   [:a-zA-Z]
            |   '\u2070'..'\u218F'
            |   '\u2C00'..'\u2FEF'
            |   '\u3001'..'\uD7FF'
            |   '\uF900'..'\uFDCF'
            |   '\uFDF0'..'\uFFFD'
            ;

Error occurs in ATNDeserializer.php, line 175 ( $characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY); ) returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Effect ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.

composer.json: { "require": { "antlr/antlr4-php-runtime": "0.5.0" } }

My Test.php:

<?php

require_once 'vendor/autoload.php';
require_once './parser/XMLParserVisitor.php';
require_once './parser/XMLParserBaseVisitor.php';
require_once './parser/XMLLexer.php';
require_once './parser/XMLParser.php';

use Antlr\Antlr4\Runtime\CommonTokenStream;
use Antlr\Antlr4\Runtime\InputStream;

use parser\XMLLexer;
use parser\XMLParser;

$expression = '<html><p>Test</p></html>';

$stream = InputStream::fromString($expression);
$lexer  = new XMLLexer($stream);
$tokens = new CommonTokenStream($lexer);
$parser = new XMLParser($tokens);

$tree = $parser->document();

Solution The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me a lot of time.

The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.

In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....

martinmolema commented 1 year ago

In the meantime found the HTML parser which of course is better than XML for parsing actual HTML, but this one contains the same problem. After I got things working, it appears to be horribly slow. A normal document can take more than 30 seconds to parse which will not work in a HTTP-call due to time-outs.

ericvergnaud commented 1 year ago

FYI:Unicode currently supports roughly 150k characters, each Unicode character has a unique 32-bit valueUTF-16 restricts the charset to the first 65636, such that they fit in 16–bits by simply removing the leading 0s. This is what Windows uses.UTF-8 is an encoding of the full charset, where each character may take from 1 to 6 bytes. This is what Java uses for string constants.Envoyé de mon iPhoneLe 15 janv. 2023 à 09:47, Martin Molema @.***> a écrit : Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below. Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4) fragment NameChar : NameStartChar | '-' | '_' | '.' | DIGIT | '\u00B7' | '\u0300'..'\u036F' | '\u203F'..'\u2040' ;

fragment NameStartChar : [:a-zA-Z] | '\u2070'..'\u218F' | '\u2C00'..'\u2FEF' | '\u3001'..'\uD7FF' | '\uF900'..'\uFDCF' | '\uFDF0'..'\uFFFD' ;

ANTLR4 runtime using antlr-4.9.3-complete.jar using ANTLR PHP Runtime version 0.5.0 I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.

Error occurs in ATNDeserializer.php, line 175 ( $characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY); ) returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php Effect ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data. composer.json: { "require": { "antlr/antlr4-php-runtime": "0.5.0" } } My Test.php: ` Test';

$stream = InputStream::fromString($expression); $lexer = new XMLLexer($stream); $tokens = new CommonTokenStream($lexer); $parser = new XMLParser($tokens);

$tree = $parser->document();

`

Solution The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me a lot of time.

The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.

In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

ericvergnaud commented 1 year ago

Can you try the latest antlr and runtime ? It may indirectly address your issue by changing the ATN serialization formatEnvoyé de mon iPhoneLe 15 janv. 2023 à 20:05, Wanadoo @.> a écrit :FYI:Unicode currently supports roughly 150k characters, each Unicode character has a unique 32-bit valueUTF-16 restricts the charset to the first 65636, such that they fit in 16–bits by simply removing the leading 0s. This is what Windows uses.UTF-8 is an encoding of the full charset, where each character may take from 1 to 6 bytes. This is what Java uses for string constants.Envoyé de mon iPhoneLe 15 janv. 2023 à 09:47, Martin Molema @.> a écrit : Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below. Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4) fragment NameChar : NameStartChar | '-' | '_' | '.' | DIGIT | '\u00B7' | '\u0300'..'\u036F' | '\u203F'..'\u2040' ;

fragment NameStartChar : [:a-zA-Z] | '\u2070'..'\u218F' | '\u2C00'..'\u2FEF' | '\u3001'..'\uD7FF' | '\uF900'..'\uFDCF' | '\uFDF0'..'\uFFFD' ;

ANTLR4 runtime using antlr-4.9.3-complete.jar using ANTLR PHP Runtime version 0.5.0 I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.

Error occurs in ATNDeserializer.php, line 175 ( $characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY); ) returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php Effect ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data. composer.json: { "require": { "antlr/antlr4-php-runtime": "0.5.0" } } My Test.php: ` Test';

$stream = InputStream::fromString($expression); $lexer = new XMLLexer($stream); $tokens = new CommonTokenStream($lexer); $parser = new XMLParser($tokens);

$tree = $parser->document();

`

Solution The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me a lot of time.

The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.

In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

kaby76 commented 1 year ago

The XML grammar doesn't work directly with the Antlr PHP 4.11.1, PHP 8. There seems to be a symbol conflict with "XMLParser", I guess in the PHP 8 runtime.

But, if in the pom.xml the grammar is renamed (XML => MyXML), along with the grammar files (XMLLexer.g4 => MyXMLLexer.g4, XMLParser.g4 => MyXMLParser.g4, with a few changes within the grammars), the generated parser appears to work. Yes, PHP 8 is required for Antlr4.11.1 PHP.

01/15-19:40:38 ~/issues/issue-2988/grammars-v4/xml
$ trgen -t PHP
C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/
CSharp  XMLLexer.g4 success 0.0587564
CSharp  XMLParser.g4 success 0.0069608
Rendering template file from PHP/build.ps1 to Generated-PHP/build.ps1
Rendering template file from PHP/build.sh to Generated-PHP/build.sh
Rendering template file from PHP/clean.ps1 to Generated-PHP/clean.ps1
Rendering template file from PHP/clean.sh to Generated-PHP/clean.sh
Rendering template file from PHP/composer.json to Generated-PHP/composer.json
Rendering template file from PHP/makefile to Generated-PHP/makefile
Rendering template file from PHP/Test.php to Generated-PHP/Test.php
Rendering template file from PHP/test.ps1 to Generated-PHP/test.ps1
Rendering template file from PHP/test.sh to Generated-PHP/test.sh
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/XMLParser.g4 to Generated-PHP/XMLParser.g4
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/XMLLexer.g4 to Generated-PHP/XMLLexer.g4
01/15-19:40:45 ~/issues/issue-2988/grammars-v4/xml
$ cd Generated-PHP/
01/15-19:40:47 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make
bash build.sh
No composer.lock file present. Updating dependencies to latest instead of installing from lock file. See https://getcomposer.org/install for more information.
Loading composer repositories with package information
Info from https://repo.packagist.org: #StandWithUkraine
Updating dependencies
Lock file operations: 3 installs, 0 updates, 0 removals
  - Locking antlr/antlr4-php-runtime (0.8.0)
  - Locking phpunit/php-timer (5.0.3)
  - Locking psr/log (3.0.0)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 3 installs, 0 updates, 0 removals
  - Installing antlr/antlr4-php-runtime (0.8.0): Extracting archive
  - Installing phpunit/php-timer (5.0.3): Extracting archive
  - Installing psr/log (3.0.0): Extracting archive
Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!
01/15-19:40:52 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make test
bash test.sh
PHP Fatal error:  Cannot declare class XMLParser, because the name is already in use in C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\XMLParser.php on line 24
PHP Stack trace:
PHP   1. {main}() C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\Test.php:0
PHP   2. require_once() C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\Test.php:6
Test failed.
mingw32-make: *** [makefile:10: test] Error 1
01/15-19:40:58 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ ls
build.ps1  clean.sh       makefile  test.sh      XMLLexer.interp  XMLParser.g4      XMLParser.tokens
build.sh   composer.json  Test.php  vendor/      XMLLexer.php     XMLParser.interp  XMLParserBaseListener.php
clean.ps1  composer.lock  test.ps1  XMLLexer.g4  XMLLexer.tokens  XMLParser.php     XMLParserListener.php
01/15-19:41:08 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ trgen -t PHP
C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/
CSharp  MyXMLLexer.g4 success 0.0483575
CSharp  MyXMLParser.g4 success 0.0066586
Rendering template file from PHP/build.ps1 to Generated-PHP/build.ps1
Rendering template file from PHP/build.sh to Generated-PHP/build.sh
Rendering template file from PHP/clean.ps1 to Generated-PHP/clean.ps1
Rendering template file from PHP/clean.sh to Generated-PHP/clean.sh
Rendering template file from PHP/composer.json to Generated-PHP/composer.json
Rendering template file from PHP/makefile to Generated-PHP/makefile
Rendering template file from PHP/Test.php to Generated-PHP/Test.php
Rendering template file from PHP/test.ps1 to Generated-PHP/test.ps1
Rendering template file from PHP/test.sh to Generated-PHP/test.sh
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/MyXMLParser.g4 to Generated-PHP/MyXMLParser.g4
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/MyXMLLexer.g4 to Generated-PHP/MyXMLLexer.g4
01/15-19:48:31 ~/issues/issue-2988/grammars-v4/xml
$ cd Generated-PHP/
01/15-19:48:34 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ ls
build.ps1  build.sh  clean.ps1  clean.sh  composer.json  makefile  MyXMLLexer.g4  MyXMLParser.g4  Test.php  test.ps1  test.sh
01/15-19:48:35 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make
bash build.sh
No composer.lock file present. Updating dependencies to latest instead of installing from lock file. See https://getcomposer.org/install for more information.
Loading composer repositories with package information
Info from https://repo.packagist.org: #StandWithUkraine
Updating dependencies
Lock file operations: 3 installs, 0 updates, 0 removals
  - Locking antlr/antlr4-php-runtime (0.8.0)
  - Locking phpunit/php-timer (5.0.3)
  - Locking psr/log (3.0.0)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 3 installs, 0 updates, 0 removals
  - Installing antlr/antlr4-php-runtime (0.8.0): Extracting archive
  - Installing phpunit/php-timer (5.0.3): Extracting archive
  - Installing psr/log (3.0.0): Extracting archive
Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!
01/15-19:48:39 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make test
bash test.sh
PHP 0 ../examples/books.xml success 0.4387135
PHP 1 ../examples/web.xml success 0.0351489
Total Time: 1.1430899
dos2unix: converting file ../examples/books.xml.errors to Unix format...
dos2unix: converting file ../examples/books.xml.tree to Unix format...
dos2unix: converting file ../examples/web.xml.errors to Unix format...
dos2unix: converting file ../examples/web.xml.tree to Unix format...
Test succeeded.
01/15-19:48:48 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP

I think the Antlr PHP runtime uses XMLParser

martinmolema commented 1 year ago

I need to setup a VM or something to use PHP 8 and then test this.

martinmolema commented 1 year ago

Ok, I have update my local dev machine to use PHP 8.1 and updated the project to use PHP Antlr 0.8.0. Current package-lock.json below.

{
    "_readme": [
        "This file locks the dependencies of your project to a known state",
        "Read more about it at https://getcomposer.org/doc/01-basic-usage.md#installing-dependencies",
        "This file is @generated automatically"
    ],
    "content-hash": "cd9b019ee661e13d2c5e0c0fdd2f17d4",
    "packages": [
        {
            "name": "antlr/antlr4-php-runtime",
            "version": "0.8.0",
            "source": {
                "type": "git",
                "url": "https://github.com/antlr/antlr-php-runtime.git",
                "reference": "7de4181629faaa4f0b9399610689cd8338c52e2c"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/antlr/antlr-php-runtime/zipball/7de4181629faaa4f0b9399610689cd8338c52e2c",
                "reference": "7de4181629faaa4f0b9399610689cd8338c52e2c",
                "shasum": ""
            },
            "require": {
                "ext-mbstring": "*",
                "php": "^8.0"
            },
            "require-dev": {
                "ergebnis/composer-normalize": "^2.15",
                "phpstan/extension-installer": "^1.0",
                "phpstan/phpstan": "^1.4",
                "phpstan/phpstan-deprecation-rules": "^1.0",
                "phpstan/phpstan-strict-rules": "^1.1",
                "slevomat/coding-standard": "^7.0",
                "squizlabs/php_codesniffer": "^3.6"
            },
            "type": "library",
            "extra": {
                "branch-alias": {
                    "dev-master": "0.2-dev"
                }
            },
            "autoload": {
                "psr-4": {
                    "Antlr\\Antlr4\\Runtime\\": "src/"
                }
            },
            "notification-url": "https://packagist.org/downloads/",
            "license": [
                "BSD-3-Clause"
            ],
            "description": "PHP 8.0+ runtime for ANTLR 4",
            "keywords": [
                "antlr4",
                "php",
                "runtime"
            ],
            "support": {
                "issues": "https://github.com/antlr/antlr-php-runtime/issues",
                "source": "https://github.com/antlr/antlr-php-runtime/tree/0.8.0"
            },
            "time": "2022-09-04T21:10:52+00:00"
        }
    ],
    "packages-dev": [],
    "aliases": [],
    "minimum-stability": "stable",
    "stability-flags": [],
    "prefer-stable": false,
    "prefer-lowest": false,
    "platform": [],
    "platform-dev": [],
    "plugin-api-version": "2.3.0"
}

I renamed the XMLParser.g4 and XMLLexer.g4 to MyXMLParser and MyXMLLexer and update my project includes. After that I generated the files (see bash-file below) using antlr-4.11.1-complete.jar.

PARSER=MyXMLParser
LEXER=MyXMLLexer

BASE_DIR=/mnt/ssd/Develop/crisisgame/newlangdef
ANTLR_DIR=${BASE_DIR}/ANTLR4

cd ${BASE_DIR}

OUTPUT_DIR=${BASE_DIR}/parser
NAMESPACE=parser
JAR_FILE=/usr/local/lib/antlr-4.11.1-complete.jar
LANGUAGE=PHP

# Clean Output Directory
rm $OUTPUT_DIR/${PARSER}*
rm $OUTPUT_DIR/${LEXER}*

export CLASSPATH=".:$JAR_FILE:$CLASSPATH"
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -no-visitor -no-listener -package $NAMESPACE -o $OUTPUT_DIR  -Xexact-output-dir ${ANTLR_DIR}/${LEXER}.g4
cp $OUTPUT_DIR/${LEXER}.tokens $ANTLR_DIR
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -visitor    -no-listener -package $NAMESPACE -o $OUTPUT_DIR  -Xexact-output-dir ${ANTLR_DIR}/${PARSER}.g4

This way the lexer and parser will run without problems. Any chance this new ATN serialisation can be incorporated in older versions so I can benefit from it using PHP 7.4?

ericvergnaud commented 1 year ago

it's very unlikely, but you can fork a PHP 7 compatible Antlr runtime and backport the serialization bits, it's not a big deal

martinmolema commented 1 year ago

Ok, so I have been able to move from Laravel/Lumen 7 to Laravel/Lumen 9 in the meantime. After reading the upgrade guides it seems I use very little that is impacted. So now my project is on PHP8.2 and I can take full advantage of the new ANTLR4 stuff.

Reason why I avoided this upgrade is some time ago I started a new Laravel project in version 9 and found it very different. Never thought to investigate the upgrade because these are often complex processess. Looking at the timestamps the basic upgrade took me 35 minutes. Of course intensive testing needs to be done.. but first steps seem optimistic.

I will close this ticket for now. In the future I will revisit the HTML-parsing . For now I reverted back to original language without the HTML bit in it because of performance issues. Perhaps this upgrades of Laravel/Lumen, ANTLR4 (to 4.11.1) and PHP8 will have significant performance upgrades?

kaby76 commented 1 year ago

Perhaps this upgrades of Laravel/Lumen, ANTLR4 (to 4.11.1) and PHP8 will have significant performance upgrades?

Note, I've been making significant modifications to the PHP runtime, but I'm still working out why the parser is still so slow in certain situations (e.g., when there's lots of ambiguity in the grammar). If and when these changes get merged, a 25% speed-up should be seen. https://github.com/antlr/antlr-php-runtime/pull/34 https://github.com/antlr/antlr-php-runtime/issues https://github.com/antlr/antlr-php-runtime/issues/36

martinmolema commented 1 year ago

Great! Looking forward to this version @kaby76