Open martinmolema opened 1 year ago
In the meantime found the HTML parser which of course is better than XML for parsing actual HTML, but this one contains the same problem. After I got things working, it appears to be horribly slow. A normal document can take more than 30 seconds to parse which will not work in a HTTP-call due to time-outs.
FYI:Unicode currently supports roughly 150k characters, each Unicode character has a unique 32-bit valueUTF-16 restricts the charset to the first 65636, such that they fit in 16–bits by simply removing the leading 0s. This is what Windows uses.UTF-8 is an encoding of the full charset, where each character may take from 1 to 6 bytes. This is what Java uses for string constants.Envoyé de mon iPhoneLe 15 janv. 2023 à 09:47, Martin Molema @.***> a écrit : Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below. Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4) fragment NameChar : NameStartChar | '-' | '_' | '.' | DIGIT | '\u00B7' | '\u0300'..'\u036F' | '\u203F'..'\u2040' ;
fragment NameStartChar : [:a-zA-Z] | '\u2070'..'\u218F' | '\u2C00'..'\u2FEF' | '\u3001'..'\uD7FF' | '\uF900'..'\uFDCF' | '\uFDF0'..'\uFFFD' ;
ANTLR4 runtime using antlr-4.9.3-complete.jar using ANTLR PHP Runtime version 0.5.0 I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.
Error occurs in ATNDeserializer.php, line 175 ( $characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY); ) returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php Effect ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data. composer.json: { "require": { "antlr/antlr4-php-runtime": "0.5.0" } } My Test.php: ` Test';
$stream = InputStream::fromString($expression); $lexer = new XMLLexer($stream); $tokens = new CommonTokenStream($lexer); $parser = new XMLParser($tokens);
$tree = $parser->document();
`
Solution The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me a lot of time.
The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.
In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>
Can you try the latest antlr and runtime ? It may indirectly address your issue by changing the ATN serialization formatEnvoyé de mon iPhoneLe 15 janv. 2023 à 20:05, Wanadoo @.> a écrit :FYI:Unicode currently supports roughly 150k characters, each Unicode character has a unique 32-bit valueUTF-16 restricts the charset to the first 65636, such that they fit in 16–bits by simply removing the leading 0s. This is what Windows uses.UTF-8 is an encoding of the full charset, where each character may take from 1 to 6 bytes. This is what Java uses for string constants.Envoyé de mon iPhoneLe 15 janv. 2023 à 09:47, Martin Molema @.> a écrit : Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below. Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4) fragment NameChar : NameStartChar | '-' | '_' | '.' | DIGIT | '\u00B7' | '\u0300'..'\u036F' | '\u203F'..'\u2040' ;
fragment NameStartChar : [:a-zA-Z] | '\u2070'..'\u218F' | '\u2C00'..'\u2FEF' | '\u3001'..'\uD7FF' | '\uF900'..'\uFDCF' | '\uFDF0'..'\uFFFD' ;
ANTLR4 runtime using antlr-4.9.3-complete.jar using ANTLR PHP Runtime version 0.5.0 I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.
Error occurs in ATNDeserializer.php, line 175 ( $characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY); ) returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php Effect ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data. composer.json: { "require": { "antlr/antlr4-php-runtime": "0.5.0" } } My Test.php: ` Test';
$stream = InputStream::fromString($expression); $lexer = new XMLLexer($stream); $tokens = new CommonTokenStream($lexer); $parser = new XMLParser($tokens);
$tree = $parser->document();
`
Solution The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me a lot of time.
The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.
In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>
The XML grammar doesn't work directly with the Antlr PHP 4.11.1, PHP 8. There seems to be a symbol conflict with "XMLParser", I guess in the PHP 8 runtime.
But, if in the pom.xml the grammar is renamed (XML => MyXML), along with the grammar files (XMLLexer.g4 => MyXMLLexer.g4, XMLParser.g4 => MyXMLParser.g4, with a few changes within the grammars), the generated parser appears to work. Yes, PHP 8 is required for Antlr4.11.1 PHP.
01/15-19:40:38 ~/issues/issue-2988/grammars-v4/xml
$ trgen -t PHP
C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/
CSharp XMLLexer.g4 success 0.0587564
CSharp XMLParser.g4 success 0.0069608
Rendering template file from PHP/build.ps1 to Generated-PHP/build.ps1
Rendering template file from PHP/build.sh to Generated-PHP/build.sh
Rendering template file from PHP/clean.ps1 to Generated-PHP/clean.ps1
Rendering template file from PHP/clean.sh to Generated-PHP/clean.sh
Rendering template file from PHP/composer.json to Generated-PHP/composer.json
Rendering template file from PHP/makefile to Generated-PHP/makefile
Rendering template file from PHP/Test.php to Generated-PHP/Test.php
Rendering template file from PHP/test.ps1 to Generated-PHP/test.ps1
Rendering template file from PHP/test.sh to Generated-PHP/test.sh
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/XMLParser.g4 to Generated-PHP/XMLParser.g4
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/XMLLexer.g4 to Generated-PHP/XMLLexer.g4
01/15-19:40:45 ~/issues/issue-2988/grammars-v4/xml
$ cd Generated-PHP/
01/15-19:40:47 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make
bash build.sh
No composer.lock file present. Updating dependencies to latest instead of installing from lock file. See https://getcomposer.org/install for more information.
Loading composer repositories with package information
Info from https://repo.packagist.org: #StandWithUkraine
Updating dependencies
Lock file operations: 3 installs, 0 updates, 0 removals
- Locking antlr/antlr4-php-runtime (0.8.0)
- Locking phpunit/php-timer (5.0.3)
- Locking psr/log (3.0.0)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 3 installs, 0 updates, 0 removals
- Installing antlr/antlr4-php-runtime (0.8.0): Extracting archive
- Installing phpunit/php-timer (5.0.3): Extracting archive
- Installing psr/log (3.0.0): Extracting archive
Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!
01/15-19:40:52 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make test
bash test.sh
PHP Fatal error: Cannot declare class XMLParser, because the name is already in use in C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\XMLParser.php on line 24
PHP Stack trace:
PHP 1. {main}() C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\Test.php:0
PHP 2. require_once() C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\Test.php:6
Test failed.
mingw32-make: *** [makefile:10: test] Error 1
01/15-19:40:58 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ ls
build.ps1 clean.sh makefile test.sh XMLLexer.interp XMLParser.g4 XMLParser.tokens
build.sh composer.json Test.php vendor/ XMLLexer.php XMLParser.interp XMLParserBaseListener.php
clean.ps1 composer.lock test.ps1 XMLLexer.g4 XMLLexer.tokens XMLParser.php XMLParserListener.php
01/15-19:41:08 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ trgen -t PHP
C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/
CSharp MyXMLLexer.g4 success 0.0483575
CSharp MyXMLParser.g4 success 0.0066586
Rendering template file from PHP/build.ps1 to Generated-PHP/build.ps1
Rendering template file from PHP/build.sh to Generated-PHP/build.sh
Rendering template file from PHP/clean.ps1 to Generated-PHP/clean.ps1
Rendering template file from PHP/clean.sh to Generated-PHP/clean.sh
Rendering template file from PHP/composer.json to Generated-PHP/composer.json
Rendering template file from PHP/makefile to Generated-PHP/makefile
Rendering template file from PHP/Test.php to Generated-PHP/Test.php
Rendering template file from PHP/test.ps1 to Generated-PHP/test.ps1
Rendering template file from PHP/test.sh to Generated-PHP/test.sh
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/MyXMLParser.g4 to Generated-PHP/MyXMLParser.g4
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/MyXMLLexer.g4 to Generated-PHP/MyXMLLexer.g4
01/15-19:48:31 ~/issues/issue-2988/grammars-v4/xml
$ cd Generated-PHP/
01/15-19:48:34 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ ls
build.ps1 build.sh clean.ps1 clean.sh composer.json makefile MyXMLLexer.g4 MyXMLParser.g4 Test.php test.ps1 test.sh
01/15-19:48:35 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make
bash build.sh
No composer.lock file present. Updating dependencies to latest instead of installing from lock file. See https://getcomposer.org/install for more information.
Loading composer repositories with package information
Info from https://repo.packagist.org: #StandWithUkraine
Updating dependencies
Lock file operations: 3 installs, 0 updates, 0 removals
- Locking antlr/antlr4-php-runtime (0.8.0)
- Locking phpunit/php-timer (5.0.3)
- Locking psr/log (3.0.0)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 3 installs, 0 updates, 0 removals
- Installing antlr/antlr4-php-runtime (0.8.0): Extracting archive
- Installing phpunit/php-timer (5.0.3): Extracting archive
- Installing psr/log (3.0.0): Extracting archive
Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!
01/15-19:48:39 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make test
bash test.sh
PHP 0 ../examples/books.xml success 0.4387135
PHP 1 ../examples/web.xml success 0.0351489
Total Time: 1.1430899
dos2unix: converting file ../examples/books.xml.errors to Unix format...
dos2unix: converting file ../examples/books.xml.tree to Unix format...
dos2unix: converting file ../examples/web.xml.errors to Unix format...
dos2unix: converting file ../examples/web.xml.tree to Unix format...
Test succeeded.
01/15-19:48:48 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
I think the Antlr PHP runtime uses XMLParser
I need to setup a VM or something to use PHP 8 and then test this.
Ok, I have update my local dev machine to use PHP 8.1 and updated the project to use PHP Antlr 0.8.0. Current package-lock.json below.
{
"_readme": [
"This file locks the dependencies of your project to a known state",
"Read more about it at https://getcomposer.org/doc/01-basic-usage.md#installing-dependencies",
"This file is @generated automatically"
],
"content-hash": "cd9b019ee661e13d2c5e0c0fdd2f17d4",
"packages": [
{
"name": "antlr/antlr4-php-runtime",
"version": "0.8.0",
"source": {
"type": "git",
"url": "https://github.com/antlr/antlr-php-runtime.git",
"reference": "7de4181629faaa4f0b9399610689cd8338c52e2c"
},
"dist": {
"type": "zip",
"url": "https://api.github.com/repos/antlr/antlr-php-runtime/zipball/7de4181629faaa4f0b9399610689cd8338c52e2c",
"reference": "7de4181629faaa4f0b9399610689cd8338c52e2c",
"shasum": ""
},
"require": {
"ext-mbstring": "*",
"php": "^8.0"
},
"require-dev": {
"ergebnis/composer-normalize": "^2.15",
"phpstan/extension-installer": "^1.0",
"phpstan/phpstan": "^1.4",
"phpstan/phpstan-deprecation-rules": "^1.0",
"phpstan/phpstan-strict-rules": "^1.1",
"slevomat/coding-standard": "^7.0",
"squizlabs/php_codesniffer": "^3.6"
},
"type": "library",
"extra": {
"branch-alias": {
"dev-master": "0.2-dev"
}
},
"autoload": {
"psr-4": {
"Antlr\\Antlr4\\Runtime\\": "src/"
}
},
"notification-url": "https://packagist.org/downloads/",
"license": [
"BSD-3-Clause"
],
"description": "PHP 8.0+ runtime for ANTLR 4",
"keywords": [
"antlr4",
"php",
"runtime"
],
"support": {
"issues": "https://github.com/antlr/antlr-php-runtime/issues",
"source": "https://github.com/antlr/antlr-php-runtime/tree/0.8.0"
},
"time": "2022-09-04T21:10:52+00:00"
}
],
"packages-dev": [],
"aliases": [],
"minimum-stability": "stable",
"stability-flags": [],
"prefer-stable": false,
"prefer-lowest": false,
"platform": [],
"platform-dev": [],
"plugin-api-version": "2.3.0"
}
I renamed the XMLParser.g4 and XMLLexer.g4 to MyXMLParser and MyXMLLexer and update my project includes. After that I generated the files (see bash-file below) using antlr-4.11.1-complete.jar
.
PARSER=MyXMLParser
LEXER=MyXMLLexer
BASE_DIR=/mnt/ssd/Develop/crisisgame/newlangdef
ANTLR_DIR=${BASE_DIR}/ANTLR4
cd ${BASE_DIR}
OUTPUT_DIR=${BASE_DIR}/parser
NAMESPACE=parser
JAR_FILE=/usr/local/lib/antlr-4.11.1-complete.jar
LANGUAGE=PHP
# Clean Output Directory
rm $OUTPUT_DIR/${PARSER}*
rm $OUTPUT_DIR/${LEXER}*
export CLASSPATH=".:$JAR_FILE:$CLASSPATH"
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -no-visitor -no-listener -package $NAMESPACE -o $OUTPUT_DIR -Xexact-output-dir ${ANTLR_DIR}/${LEXER}.g4
cp $OUTPUT_DIR/${LEXER}.tokens $ANTLR_DIR
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -visitor -no-listener -package $NAMESPACE -o $OUTPUT_DIR -Xexact-output-dir ${ANTLR_DIR}/${PARSER}.g4
This way the lexer and parser will run without problems. Any chance this new ATN serialisation can be incorporated in older versions so I can benefit from it using PHP 7.4?
it's very unlikely, but you can fork a PHP 7 compatible Antlr runtime and backport the serialization bits, it's not a big deal
Ok, so I have been able to move from Laravel/Lumen 7 to Laravel/Lumen 9 in the meantime. After reading the upgrade guides it seems I use very little that is impacted. So now my project is on PHP8.2 and I can take full advantage of the new ANTLR4 stuff.
Reason why I avoided this upgrade is some time ago I started a new Laravel project in version 9 and found it very different. Never thought to investigate the upgrade because these are often complex processess. Looking at the timestamps the basic upgrade took me 35 minutes. Of course intensive testing needs to be done.. but first steps seem optimistic.
I will close this ticket for now. In the future I will revisit the HTML-parsing . For now I reverted back to original language without the HTML bit in it because of performance issues. Perhaps this upgrades of Laravel/Lumen, ANTLR4 (to 4.11.1) and PHP8 will have significant performance upgrades?
Perhaps this upgrades of Laravel/Lumen, ANTLR4 (to 4.11.1) and PHP8 will have significant performance upgrades?
Note, I've been making significant modifications to the PHP runtime, but I'm still working out why the parser is still so slow in certain situations (e.g., when there's lots of ambiguity in the grammar). If and when these changes get merged, a 25% speed-up should be seen. https://github.com/antlr/antlr-php-runtime/pull/34 https://github.com/antlr/antlr-php-runtime/issues https://github.com/antlr/antlr-php-runtime/issues/36
Great! Looking forward to this version @kaby76
Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.
Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)
Error occurs in ATNDeserializer.php, line 175 (
$characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY);
) returningfalse
. This is described in theu
-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.phpEffect ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.
composer.json:
{ "require": { "antlr/antlr4-php-runtime": "0.5.0" } }
My Test.php:
Solution The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me a lot of time.
The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.
In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....