(RFC) Serializable ParseTree

HSorensen commented 9 years ago

I was looking into serializing the ParseTree and much to my surprise only 7 classes need to implement the Serializable interface:

public class RuleContext implements RuleNode, Serializable { }
public class TerminalNodeImpl implements TerminalNode, Serializable { }
public abstract class ATNSimulator implements Serializable { }
public abstract class Lexer extends Recognizer<Integer, LexerATNSimulator> implements TokenSource, Serializable { }
public class CommonTokenFactory implements TokenFactory<CommonToken>, Serializable { }
public class ANTLRInputStream implements CharStream, Serializable { }
public class IntegerList implements Serializable { }

With these changes it becomes straight forward to store the parse tree:

        parser.setBuildParseTree(true);
        ParseTree tree = parser.program();
        OutputStream outputStream = new FileOutputStream("program.parsetree");
        ObjectOutputStream objectOutputStream = new ObjectOutputStream(outputStream);
        objectOutputStream.writeObject(tree);
        objectOutputStream.close();

and retrieving the parse tree

        InputStream inputStream = new FileInputStream("program.parsetree");
        ObjectInputStream objectInputStream = new ObjectInputStream(inputStream);
        ParseTree tree=(ParseTree) objectInputStream.readObject();
        objectInputStream.close();

Some background: I need to run several static source code analysis using many different tree visitors and tree walkers, some of the analysis if done in foreground other in background. The language I am parsing supports inclusion of source files, and it is common to have literally 100's of files that all need to be parsed together, with the serialized parse-tree there is just one file to maintain. Alternatively to serializing the parse tree I would have to maintain all these files together and recreate the parse tree every time I need it.

I realize that this feature probably make most sense for the use case that need to support inclusion of source code see my ANTLR fork or commit ddf9a331

That said if there is interest I will prepare a new branch and pull-request including test cases etc.

HSorensen commented 9 years ago

When reusing a compressed parse tree it can be significant faster to use the serialized parse tree instead of recreating the parse tree from the source code:

Creating the parse tree: 31s
Compressed parse tree (2.3mb) write 21s / read 2s
Uncompressed parse tree (25.6mb) write 88s / read 22s

Using the Apache Commons Compression routines can yield really good compression ratio of 90%. Write compressed parse tree

        parser.setBuildParseTree(true);
        ParseTree tree = parser.program();
        OutputStream xzOutputStream = new FileOutputStream("program.parsetree.xz");
        CompressorOutputStream xzOut = new CompressorStreamFactory().createCompressorOutputStream(CompressorStreamFactory.XZ, xzOutputStream);
        ObjectOutputStream xzObjectOutputStream = new ObjectOutputStream(xzOut);
        xzObjectOutputStream.writeObject(tree);
        xzObjectOutputStream.close();

and retrieve a compressed parse tree

        InputStream inputStream = new FileInputStream("program.parsetree.xz");
        CompressorInputStream input = new CompressorStreamFactory().createCompressorInputStream(CompressorStreamFactory.XZ, inputStream);
        ObjectInputStream objectInputStream = new ObjectInputStream(input);
        ParseTree tree=(ParseTree) objectInputStream.readObject();
        objectInputStream.close();

KvanTTT commented 9 years ago

Parsing Tree serialization to JSON, XML and other formats is a good idea. One of use cases is unit testing. For example: In my current work project I have several stages of source code processing:

Parsing (C#, Java, PHP) code to Language Dependent (LD) AST (with ANTLR and Roslyn).
Converting LD AST to Language Independent (LI) AST.
Pattern Matching (PM) in LI AST.

At present parsing stage have a lowest performace (for ANTLR) сompared to other stages. And If I want to test PM stage, I still have to parse source code. It will be great to serialize parsing tree one time and use it for second or third stage testing many times.

HSorensen commented 9 years ago

@KvanTTT Currently only Java is supported, I would be interested to know how it would work for you. Do you have a setup where it is possible to change the antlr runtime for the Java part of your parsing?

NullTerminat0r commented 8 years ago

This seems like a good idea. I too am finding that my JUnit test cases are running slowly because my parse (of some existing code I'm trying to analyse) takes 70s. Walking the resulting parse tree has sub-second response times.

JitCompiler commented 8 years ago

Any idea if these changes to the Java classes will be incorporated into the release build?

HSorensen commented 8 years ago

I can probably prepare a release build of my ANTLR fork, it is based on mainstream v4.5.x

JitCompiler commented 8 years ago

thanks, that's great

HSorensen commented 8 years ago

@parrt or @sharwell would it be possible to host a "Community Build" somewhere under the antlr.org web-pages, maybe have a "Community Builds" section under the download page ? Also the http://www.antlr.org/download/index.html page could have a sub-directory communitybuilds. Any other name may work of course."Experimental Builds" could also work. Thoughts ?

Groostav commented 8 years ago

Bump, I'm also interested in this feature from a unit-testing perspective. I'm not sure I want to go off-mainline for a serializable parse tree, but exploring the concept of reducing the parse tree to something like XML with some level of built-in facilities would be very useful for testing.

I should mention I did accidentally point XStream at an object that had a reference to the root of a parse tree, causing XStream to serialize the whole thing, and it actually produced some not-terrible XML.

sharwell commented 8 years ago

Have you considered instead serializing the sequence of decisions made by the parser's prediction algorithm? Deserialization would consist of parsing using a modified ParserATNSimulator which replays those values from calls to adaptivePredict (via a queue).

HSorensen commented 8 years ago

Sam, what you suggest sounds like a more portable approach. can you outline where to do this and how to replay the sequence to recreate the parsetree.

Henrik

On Friday, August 5, 2016, Sam Harwell notifications@github.com wrote:

Have you considered instead serializing the sequence of decisions made by the parser's prediction algorithm?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1006#issuecomment-237836672, or mute the thread https://github.com/notifications/unsubscribe-auth/AEztNSyxrVVpPwUAxTFEQphBofgARmrgks5qcystgaJpZM4GGOTA .

wsdookadr commented 8 years ago

Hi,

As part of a project I'm working on, I wrote an XML serializer for an ANTLR4 parse tree. In summary, this is how it works:

every node(or parse rule if you will) in the parse tree corresponds to an XML node in the serialized XML form. The XML node is named after its respective rule in the parse tree
a stack is used(and updated along the traversal) to track the parent of the node that's about to be inserted into the XML document
the XML nodes are equipped with attributes for start and end line number of the parse tree node
the XML nodes should be equipped with attributes for start and end offset of the parse tree node (not in place at this time, may be added later on, but I feel like it should be included in the RFC)

I feel it would be useful to have the file name (associated with the parse tree) in the top-level node, or some way of refering back to the document that was parsed (I need this in my project, but I think other people may find that useful too).

The line and offset attributes discussed above may be optional, but I believe they are useful because they allow to measure the "source"-size that a subtree carries, and by that I mean, how much of the actual source document(that was parsed) does a subtree describe.

Best regards, Stefan

KvanTTT commented 8 years ago

Is it possible to use JSON instead of verbose XML?

HSorensen commented 8 years ago

Stefan, do you also have code that can take the XML and recreate a ParseTree ?

Henrik

On Mon, Sep 5, 2016 at 4:48 AM, Stefan Corneliu Petrea < notifications@github.com> wrote:

Hi,

As part of a project I'm working on, I wrote an XML serializer https://github.com/wsdookadr/mdetect/blob/master/src/main/java/com/mdetect/ParseTreeDOMSerializer.java for a parse tree. In summary, this is how it works:

every node(or parse rule if you will) in the parse tree corresponds to an XML node in the serialized XML form

a stack is used(and updated along the traversal) to track the parent of the node that's about to be inserted into the XML document

attributes involving start and end line number are inserted for node

attributes such as start and end offset of rule should be added (not in place at this time, may be added later on)

I feel it would be useful to have the file name (associated with the parse tree) in the top-level node, or some way of refering back to the document that was parsed (I need this in my project, but I think other people may find that useful too).

Best regards, Stefan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1006#issuecomment-244728177, or mute the thread https://github.com/notifications/unsubscribe-auth/AEztNX31d42B1X0jgmAA6Bowo7_G78NRks5qnAGNgaJpZM4GGOTA .

wsdookadr commented 8 years ago

@HSorensen I don't

TimYi commented 7 years ago

Maybe generate a json like parse tree data, and wrap it into a ParseTree class. A ParseTree walk through the json data internal, and have a create method use this json data. And the json data should be compatible to all target language ParseTree class.

antlr / antlr4

(RFC) Serializable ParseTree #1006