congo-cc / congo-parser-generator

The CongoCC Parser Generator, the Next Generation of JavaCC 21, which in turn was the next generation of JavaCC
https://discuss.congocc.org/
Other
36 stars 11 forks source link

Internal parsers seem to be not usable #125

Closed vsajip closed 6 months ago

vsajip commented 6 months ago

I wrote this small program to exercise one of the internal parsers:

import org.congocc.parser.Node;
import org.congocc.parser.python.PythonParser;

import java.io.IOException;
import java.nio.file.Paths;

public class PyParse {
    public static void main(String[] args) throws IOException {
        PythonParser parser = new PythonParser(Paths.get(args[0]));
        Node module = parser.Module();

        System.out.println(module);
    }
}

But when this program is run on a Python file, it fails with a ClassCastException:

$ java -cp ~/projects/congocc/congocc.jar:. PyParse pythonparser/__init__.py
Exception in thread "main" java.lang.ClassCastException: class org.congocc.parser.python.PythonLexer cannot be cast to class org.congocc.parser.CongoCCLexer (org.congocc.parser.python.PythonLexer and org.congocc.parser.CongoCCLexer are in unnamed module of loader 'app')
    at org.congocc.parser.tree.BaseNode.setTokenSource(BaseNode.java:102)
    at org.congocc.parser.python.PythonParser.openNodeScope(PythonParser.java:24609)
    at org.congocc.parser.python.PythonParser.Module(PythonParser.java:308)
    at PyParse.main(PyParse.java:10)

Here's the relevant section (in generated source, so can't link to it):

    private CongoCCLexer tokenSource;

    public CongoCCLexer getTokenSource() {
        if (tokenSource == null) {
            for (Node child : children()) {
                if (child.getTokenSource() instanceof CongoCCLexer) tokenSource = (CongoCCLexer) child.getTokenSource();
                if (tokenSource != null) break;
            }
        }
        return tokenSource;
    }

    public void setTokenSource(TokenSource tokenSource) {
        this.tokenSource = (CongoCCLexer) tokenSource;
    }

There might be a number of ways of addressing this, but I think the best person to suggest a good fix is @revusky - I'm not sure why the token source needs to be a CongoCCLexer.

revusky commented 6 months ago

Wow, that seems to be a pretty major bug in the code generation. I'm utterly confused, since I thought I had tested the internal parsers separately. Or did I?

Okay, first of all, the difference between the internal parsers and the ones generated from examples/* is that the internal parsers reuse base node API that was generated for CongoCC. This is technically necessary to be able to put Python tokens/nodes into the same tree. This is how we're going to (eventually) get INJECT working for non-Java languages. Also, more generally, it will allow us to build polyglot AST's where the nodes come from different grammars but co-exist in the same tree. So if you look at PythonInternal.ccc, it sets the ROOT_API_PACKAGE so that it can inherit from there rather than define those API's.

What I am puzzled by is that the base node class for the Python AST should be org.congocc.parser.python.tree.PythonNode, I think. Note how no PythonNode is being generated, but a PythonToken is. Hold on. Surely I can get at the root of this problem...

revusky commented 6 months ago

It seems like the line that is causing the problems is this one: https://github.com/congo-cc/congo-parser-generator/blob/b94681c3ba0cbdc734e3b9c3cb5aadeceaeb08f3/src/grammars/PythonInternal.ccc#L5

If you comment that line out, then the PythonNode is generated and the various concrete node subclasses in org.congocc.parser.python.ast subclass from that. And the strange thing is that if you look at CSharpInternal.ccc the corresponding line is not there. And if you generate the parser from that grammar, it generates a CSNode.java (corresponding to the fact that the BASE_NAME is CS and the various tree node classes descend from that.

So, basically, comment out line 5 in PythonInternal.ccc and see if you can get it working.

vsajip commented 6 months ago

OK. We should probably have some smoke tests in the core tests suite which exercise the internal compilers, at least minimally.

revusky commented 6 months ago

So, you got it working, no?

Yes, there should be a test of this -- though, ultimately, we'll be testing the internal parsers by having INJECT working and so on. But, yes, of course, we should test them meanwhile. I'm pretty sure I was testing the internal parsers at some point, like I switched from testing the examples/python parser to the one generated for internal use, the difference being that the one for internal use does not generate its own Node.java etc. but reuses the one generated for the CongoCC parser. But, for some reason, I went back to using the one in examples/*. I don't know how that erroneous line got in there. Maybe I was experimenting and unintentionally committed the line, which, of course, doesn't work.

vsajip commented 6 months ago

Yes, that worked - though I had to do an ant clean first, so perhaps the dependencies need checking.

vsajip commented 6 months ago

I've made some changes and a pull request #127 based on that - the Python internal parser doesn't throw up any errors, but the C# one does - @revusky could you take a quick look and see if you can spot any flaws in my testing logic?

revusky commented 6 months ago

Latest commit should address the problem. The issue is that both the settings USES_PREPROCESSOR and ENSURE_FINAL_EOL have to be set for the preprocessor functionality to work. The latter one ENSURE_FINAL_EOL is necessary because, on occasion, a file ends with #endif and there is no final newline. But anyway, the base CSharp.ccc specified those settings and they need to be repeated in the CSharpInternal.ccc because they won't be picked up from an included grammar.

vsajip commented 6 months ago

The issue is that both the settings USES_PREPROCESSOR and ENSURE_FINAL_EOL have to be set for the preprocessor functionality to work

Should we ensure, then, that the use of the former automatically enables the latter?

revusky commented 6 months ago

The issue is that both the settings USES_PREPROCESSOR and ENSURE_FINAL_EOL have to be set for the preprocessor functionality to work

Should we ensure, then, that the use of the former automatically enables the latter?

Well, maybe. I was actually just thinking that maybe ENSURE_FINAL_EOL could be true by default, regardless of whether USES_PREPROCESSOR is set. So I'm thinking about the practical question of how often you really want to have the final line in a text file not end in a newline. Of course, much of the time, one might be indifferent to it, but the fact remains that there are these rather annoying situations where you have line-based logic and the fact that the final line in the file may or may not end with a newline... So, just quietly adding the trailing newline if it's not there might just save people some bother really. Of course, if you take the purist position that the input to the parser should be character for character what is in the file, so if the final newline is not there... But actually, we already violate that "principle" since I made PRESERVE_LINE_ENDINGS false by default, which means that, by default, all the line endings are normalized into `\n'. I just tend to think that we should try to have defaults that tend to cause less trouble for people on average, if you see what I mean...