antlr / antlr-php-runtime

PHP Runtime for ANTLR4
BSD 3-Clause "New" or "Revised" License
81 stars 19 forks source link

Fix for #12 -- Allowed memory size of 134217728 bytes exhausted. #34

Open kaby76 opened 1 year ago

kaby76 commented 1 year ago

Summary

This is a fix for Issue #12. The main problem was due to some confusion over Map, Set, ===, hash(), and equals(), which is easy to get wrong because these data structures are pretty complicated. (In fact, I think I may have the equivalent() method wrong in the anonymous class in ATNConfigSet.)

Checklist

I've tested this with grammars-v4/aql. The parse completes for for.aql now, and the parse tree is the same between PHP and CSharp.

I will be checking the parser ATN trace for all grammars in grammars-v4 and examples.

kaby76 commented 1 year ago

There doesn't seem to be any assert() calls in the PHP runtime--at all. For example, here in C# tests for some sanity in the stream, but it is missing here in PHP. How can I guarantee code safe?

marcospassos commented 1 year ago

PHP does have support for assertions: https://www.php.net/manual/en/function.assert.php

But I'd recommend using exceptions for that case.

kaby76 commented 1 year ago

I'm trying to fix the ATNConfigSet configLookup data structure, and realizing that what I thought was a Map isn't a Map like Java and C#. Consequently, when I tried to update equivalent() to be something akind to pointer equivalence, nothing worked. So, I read the source code in the Antlr PHP runtime, OpenJDK HashMap, and Dotnet Dictionary.

Map uses an Equivalence, but this--strangly--inherits from Equatable.

interface Equivalence extends Equatable
{
    public function equivalent(Hashable $left, Hashable $right): bool;
    public function hash(Hashable $value): int;
}

interface Equatable
{
    public function equals(object $other): bool;
}

Now, HashMap<> in Java uses equals(); Dictionary<> in the Dotnet runtime also uses Equals(). There is no such thing as the "Equivalence" class in C# or Java. There is no equivalent() method to cloud which comparison method to use/define, and determine what is the difference between the two.

In the Antlr PHP runtime, when entering a value into a Map, it uses equivalent(), not equals() That's explains why my changes to equivalent() for phpstan broke my fix.

Can someone tell me what in the world equals() is supposed to do in relation to equivalent()? Are they different or the same? Map doesn't even use equals().

marcospassos commented 1 year ago

PHP does not have native support for logical value comparison. So, hash-based data structures need to know how to compare the value for equality. That's where Equivalence comes in.

For example, to reproduce Java’s behavior, one needs to implement an Equivalence that just delegates the check to the equals method. But it also allows to have custom logic in case the equals doesn't meet the requirement.

kaby76 commented 1 year ago

PHP does not have native support for logical value comparison. So, hash-based data structures need to know how to compare the value for equality. That's where Equivalence comes in.

For example, to reproduce Java’s behavior, one needs to implement an Equivalence that just delegates the check to the equals method. But it also allows to have custom logic in case the equals doesn't meet the requirement.

Thanks, got it.

So, I know what Equivalence.equivalent() is supposed to do, and I also know what Equivalence.hash() is supposed to do. But what is Equivalence.equals() supposed to do?

In fact, as a check, I just removed the inheritance in Equivalence to Equatable, removed the implementations for Equivalence.equals() and it all works fine. equals() is never called in a Map or Set. And that's the only places where Equivalence are used.

Actually, Equivalence.equals() is referenced in Set here https://github.com/antlr/antlr-php-runtime/blob/ac71eb377c816c886b674a7ef805b54ca881d96c/src/Utils/Set.php#L173. But, it can be removed, and it still all works fine.

I must say, phpstan and phpcs are really, really essential. Great tools..

marcospassos commented 1 year ago

Equivalence is an object as any other object. The equal method just tells whether an equivalence relation is equal to another. In other words, a relation R is said to be equal to another relation W if, and only if, they are the same class and impose the same equivalence relation. In practice, for the implementation shipped in the target, it means being in the same class (since they are final classes).

marcospassos commented 1 year ago

I must say, phpstan and phpcs are really, really essential. Great tools

PHPStan is one of the most advanced analysis tool for PHP. It provides static-time support for advanced types like generics, conditional types (like TS), and some other capabilities that even the Java compiler doesn't support. Reason why I maintain the PHP extension for Antlr.

marcospassos commented 1 year ago

In some places, we're using anonymous classes for implementing custom equivalences. That may be related to the problem. If so, it's just a matter of extracting the class into named classes.

The point is that two sets are only equivalent if they have the same elements and the same equivalence relation. That's why the equals method in Equivalence is essential.

kaby76 commented 1 year ago

I haven't yet found the bug with the circular pointer chain issue. But, I'm making my way through the g4 grammars. It seems that for many cases, the code is 20% faster, sometimes more. That said, there is some odd behavior in that some inputs for a grammar are very fast, e.g., ~0.3s, and then other inputs for the same grammar are really slow, like 5-100s depending on the grammar. I think it has to do with handling ambiguity.

For example, for agc input ASSEMBLY_AND_OPERATION_INFORMATION.agc, the parse is a reasonable 0.4s. But, for PINBALL_GAME_BUTTONS_AND_LIGHTS.agc, it takes a staggering 88s (or 120s for current "dev" branch). The lab.antlr.org tester says "line_instruction" alt 6 is very ambiguous, with "max k"=12, but executed ~2000 times.

kaby76 commented 1 year ago

The code is still having some ATN trace differences: css3, bootstrap-theme.min.css. Need to fix these first before continuing with performance.

kaby76 commented 1 year ago

I still do not know why, but I know where there's a divergence between C# and PHP. It is in the stack merging of sub-parsers (see Section 5.1 of https://www.antlr.org/papers/allstar-techreport.pdf). The parent chains between the two targets look the same, but are not. In other words, the merge caches differ between targets. The merge cache is implemented in CSharp as MergeCache, which contains a field Dictionary<PredictionContext, Dictionary<PredictionContext, PredictionContext>>. In PHP, it is implemented as DoubleKeyMap. This code is placed strangely in src/Utils/ as though it is a general data structure, but only stores PredictionContext, and only used for subparser merging (in C#, it is in the atn/ code). I am not confident whether the C# or PHP code is working correctly yet as this code is pretty complicated.

marcospassos commented 1 year ago

Maybe compare to a third, battle-tested target? I always take the Java target as the ground truth.

kaby76 commented 1 year ago

Maybe compare to a third, battle-tested target? I always take the Java target as the ground truth.

Java and CSharp ATN traces are in complete agreement (~446 MB output size).

kaby76 commented 1 year ago

I decided to just pull out the -trace option on the parsing test and just compare the parse trees between CSharp, Java, and PHP. Yes, it looks like PHP isn't working correctly sometimes. Java and CSharp are in agreement.

~/issues/issue-3462/kaby76/css3/Generated-PHP ~/issues/issue-3462/kaby76/css3
PHP 0 ../examples/mdn-at-counter-style.css success 14.5657256
Total Time: 14.6478964
dos2unix: converting file o.txt to Unix format...
~/issues/issue-3462/kaby76/css3
~/issues/issue-3462/kaby76/css3/Generated-CSharp ~/issues/issue-3462/kaby76/css3
CSharp 0 ../examples/mdn-at-counter-style.css success 0.1630958
Total Time: 0.2249674
dos2unix: converting file o.txt to Unix format...
~/issues/issue-3462/kaby76/css3
1c1
< (stylesheet ws (nestedStatement (counterStyle @counter-style (ws  ) (ident circled-alpha) (ws  ) { (ws \n  ) (declarationList (declaration (property_ (ident system) ws) : (ws  ) (expr (term (ident fixed) ws))) ws ; (ws \n  ) (declaration (property_ (ident symbols) ws) : (ws  ) (expr (term (ident Ⓐ) (ws  )) (term (ident Ⓑ) (ws  )) (term (ident Ⓒ) (ws  )) (term (ident Ⓓ) (ws  )) (term (ident Ⓔ) (ws  )) (term (ident Ⓕ) (ws  )) (term (ident Ⓖ) (ws  )) (term (ident Ⓗ) (ws  )) (term (ident Ⓘ) (ws  )) (term (ident Ⓙ) (ws  )) (term (ident Ⓚ) (ws  )) (term (ident Ⓛ) (ws  )) (term (ident Ⓜ) (ws  )) (term (ident Ⓝ) (ws  )) (term (ident Ⓞ) (ws  )) (term (ident Ⓟ) (ws  )) (term (ident Ⓠ) (ws  )) (term (ident Ⓡ) (ws  )) (term (ident Ⓢ) (ws  )) (term (ident Ⓣ) (ws  )) (term (ident Ⓤ) (ws  )) (term (ident Ⓥ) (ws  )) (term (ident Ⓦ) (ws  )) (term (ident Ⓧ) (ws  )) (term (ident Ⓨ) (ws  )) (term (ident Ⓩ) ws))) ; (ws \n  ) (declaration (property_ (ident suffix) ws) : (ws  ) (expr (term " " ws))) ; (ws \n)) } (ws \n))) <EOF>)
---
> (stylesheet ws (nestedStatement (counterStyle @counter-style (ws  ) (ident circled-alpha) (ws  ) { (ws \n  ) (declarationList (declaration (property_ (ident system) ws) : (ws  ) (expr (term (ident fixed) ws))) ws ; (ws \n  ) (declaration (property_ (ident symbols) ws) : (ws  ) (expr (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) ws))) ; (ws \n  ) (declaration (property_ (ident suffix) ws) : (ws  ) (expr (term " " ws))) ; (ws \n)) } (ws \n))) <EOF>)
marcospassos commented 1 year ago

@kaby76, I just didn't invest more time here to refine the implementation because you said there are still some differences.

If you could fix these differences, I'll gladly review, test and finish everything else. I want to make sure there are no differences left before merging this.

kaby76 commented 1 year ago

That's fine. I will come back to this after going through the tree diffing implementation for grammars-v4. We can then worry about improving performance.

marcospassos commented 1 year ago

Great! Do you have any ETA? I'm asking because I want to reserve a time slot on my side to work on this

kaby76 commented 1 year ago

I'm probably going to revisit this next week. The changes for https://github.com/antlr/grammars-v4/issues/3138 are nearly done, checking in today and tomorrow. The changes for https://github.com/antlr/grammars-v4/issues/2991 is really easy: just remaster all the .tree files and turn off any ports that aren't producing a correct tree.

kaby76 commented 1 year ago

@marcospassos Is it possible to get a release for the current 4.12.0 for the PHP runtime? That would in setting up the testing for PHP for grammars-v4 so I don't need to special case the PHP tests to use the older 4.11.x tool.

marcospassos commented 1 year ago

Sure!

marcospassos commented 1 year ago

@kaby76 done

kaby76 commented 1 year ago

Thanks!! It works!

marcospassos commented 1 year ago

@kaby76, could you please share the instructions for getting the remaining diffs? You probably didn't have time to return to this, so I'll try to take over. I just need to know what is still wrong and how to reproduce.

Also, did you try against the Java implementation that's the reference? We can't just compare PHP and C# and state that PHP is wrong. The Java target is the ground truth.

kaby76 commented 1 year ago

Sorry, I haven't had much time to recover where I was because there are so many other issues going on in grammars-v4 (stuck on the scala grammar).

The diff was in the trace for css3 as mentioned here: https://github.com/antlr/antlr-php-runtime/pull/34#issuecomment-1368313437

grammar css3 input bootstrap-theme.min.css

To test, parse trace must be on (PHP $parser->setTrace(true);, php -d memory_limit=1G Test.php; Java parser.setTrace(true);). I think I also tested using the internal atn tracing as well to check the set computations. I don't have diffs because in general the files would are gigabytes long.

To turn on atn tracing: PHP Antlr\Antlr4\Runtime\Atn\ParserATNSimulator::$traceAtnSimulation = true;; Java ParserATNSimulator.trace_atn_sim = true;). CSharp and Java always agreed, all gigabyte lengthed files. After identifying computation divergence, I used side-by-side stepping to find data structure computation errors in PHP. I used CSharp because the CSharp tools are far, far superior in almost every way over multiple Java debugging tools. For example, there is no set program counter and go in Java.

I expect to find more tree diff errors across targets for grammars-v4, but I haven't tried turning it on yet.

marcospassos commented 1 year ago

I just came up with an idea on how to help find the investigation starting point:

I'll test this approach. Provided it works, we can test it against any grammar and quickly identify under which conditions it happens.

What do you think, @kaby76?