antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.22k stars 3.29k forks source link

Making Parse Tree Serializable #233

Open JitCompiler opened 11 years ago

JitCompiler commented 11 years ago

Is it possible to save up the parse tree generated by ANTLR4? For example, by making ParserRuleContext serializable.

I'd like to use ANTLR4 to parse the source files in my project. My project is quite large and contains hundreds of source files. Usually, I need to walk through the parse trees of several source files to get my desire info. It takes a while to parse even just 1 source file. So, it would be impractical to parse all the source files again every time I start my tool in order to get 1 piece of information (e.g. callers of a function). It would be nice if I can just parse once and save the parser output into the hard disk.

Another scenario is say I need to parse a cpp source file together and its related header files. e.g. to retrieve the function declarations from the header files for each function definition in the cpp file. But each header file is also used by another cpp file so it means its parse tree would be needed again when we examine the parse tree of a second cpp file.

sharwell commented 11 years ago

I'm going to defer this issue for now, since I'm not confident how I want to handle it and other issues are consuming our time leading up to the 4.1 release.

HSorensen commented 11 years ago

Just a datapoint. On my version of antlr4 (see pull request 306) I toyed a bit with the java.io.ObjectOutputStream and java.io.ObjectInputStream to write the ParseTree to a file and by adding implements Serializable to the following classes:

org.antlr.v4.runtime.misc
  IntegerStack.java
  Pair.java

org.antlr.v4.runtime.tree
  TerminalNodeImpl.java

org.antlr.v4.runtime
  ANTLRInputStream.java
  CommonTokenFactory.java
  IncludeStrategyImpl.java
  Lexer.java
  RuleContext.java

The following works on my initial test case:

  ParseTree tree = myparser.start();            
  OutputStream outputStream = new FileOutputStream("afile.data");
  ObjectOutputStream objectOutputStream = new ObjectOutputStream(outputStream);
  objectOutputStream.writeObject(tree);
  objectOutputStream.close();

and reading the parse tree with

  ParseTree tree2;
  InputStream inputStream = new FileInputStream("afile.data");
  ObjectInputStream objectInputStream = new ObjectInputStream(inputStream);
  tree2=(ParseTree) objectInputStream.readObject();
  objectInputStream.close();

Thanks Henrik

HSorensen commented 9 years ago

I have a solution that works for me. See commit in my fork

JitCompiler commented 8 years ago

Are there any plans to incorporate all these into the release build? A simple but useful feature that has been requested for 3 years

JPMoresmau commented 7 years ago

I would like that too. Is there any philosophical objections from the maintainers? Is it just a matter of submitting a PR with some tests?

HSorensen commented 7 years ago

the current implementation is very Java centric, but works well. Sam had some ideas for a different implementation. I am currently looking at storing the parse tree in a graph database. Looks really promising.

On Tue, May 9, 2017 at 2:51 AM, JP Moresmau notifications@github.com wrote:

I would like that too. Is there any philosophical objections from the maintainers? Is it just a matter of submitting a PR with some tests?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/233#issuecomment-300116801, or mute the thread https://github.com/notifications/unsubscribe-auth/AEztNUh7LWzdkYCvfaY-9WGScnx3wywRks5r4DcggaJpZM4AmoVN .

KvanTTT commented 7 years ago

I suggest to use JSON or a simple indented tree format (toStringTree method returns similar tree).

HSorensen commented 7 years ago

Do you have a way to traverse the JSON parse tree ? One design goal of storing the parse tree was to re-use the generated tree walkers on the stored tree.

On Tue, May 9, 2017 at 10:40 AM, Ivan Kochurkin notifications@github.com wrote:

I suppose to use JSON or a simple indented tree format (toStringTree https://github.com/antlr/antlr4/blob/4.7/runtime/Java/src/org/antlr/v4/runtime/tree/ParseTree.java#L39 method returns similar tree).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/233#issuecomment-300244841, or mute the thread https://github.com/notifications/unsubscribe-auth/AEztNQXqAsEsiQHUma6QYTqVV0ihZOcCks5r4KUIgaJpZM4AmoVN .

KvanTTT commented 7 years ago

JSON is a standard format and there are plenty libraries that support it.

HSorensen commented 7 years ago

Yeah, that I understand. But can any of these libraries be used in the same manner as the generated tree walkers ?

JitCompiler commented 7 years ago

the current implementation is very Java centric, but works well. Sam had some ideas for a different implementation. I am currently looking at storing the parse tree in a graph database. Looks really promising.

4 years after making the first request I, at last, realize what the problem was. Seems like the developers were looking for some "generic" but inevitably complex solutions at the expense of an extremely simple solution for 1 specific platform - to declare a few classes implementing the Serializable interface in the Java platform.

This is the same mistake that developers have been making all over the world. Ideas from having database independent, OS independent, language independent to xxx-independent. At the end, we are sacrificing the powerful facilities offered by 1 specific platform in order to be independent from that platform.

JPMoresmau commented 7 years ago

Well I've written my own little system (for my company, so not open source) to serialize the Parse Tree into JSON, and then add a JSON tree walker that you can register call backs on a specific rule name. So it's not a transparent system (the API for callbacks is different than Antlr listeners or visitors) but it serves my purpose well.

sharwell commented 7 years ago

@JitCompiler Actually it's a bit different. Since users can modify tree nodes arbitrarily with new fields, it's not safe to claim that this type is serializable. Any solution would be at best application-specific. My hope was the overall performance of ANTLR 4 meant the need for serialization was sufficiently sparse that in the end the few people who would ever need it end up implementing it independently anyway. I believe that this is a reasonable reflection of the current reality.

HSorensen commented 7 years ago

@Sam, I don't quite follow your logic here. Wouldn't be more natural for the users that modify the tree nodes that THEY loose the capability to serialize the parse tree, or at least make THEM responsible for implementing serialization, instead of remove this all together for users that do not modify the tree nodes. The maintenance overhead is negligible, at least for the java target part I cannot speak for the targets.

On Fri, Jun 23, 2017 at 2:35 PM, Sam Harwell notifications@github.com wrote:

@JitCompiler https://github.com/jitcompiler Actually it's a bit different. Since users can modify tree nodes arbitrarily with new fields, it's not safe to claim that this type is serializable. Any solution would be at best application-specific. My hope was the overall performance of ANTLR 4 meant the need for serialization was sufficiently sparse that in the end the few people who would ever need it end up implementing it independently anyway.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/233#issuecomment-310778436, or mute the thread https://github.com/notifications/unsubscribe-auth/AEztNeT09nEd-pRU4pz-UYHOKF-OjVgnks5sHC-agaJpZM4AmoVN .

rkumar-cognam commented 7 years ago

Since users can modify tree nodes arbitrarily with new fields, it's not safe to claim that this type is serializable.

@sharwell With all due respect, does this mean trees are mutable now in ANTLR4 ? or can we just update the fields or insert new ones into the nodes ? Sorry if it was a stupid question.

sharwell commented 7 years ago

@rkumar-cognam The locals, parameters, and returns features all add fields to parse tree nodes which are not guaranteed to be serializable.

The maintenance overhead is negligible, at least for the java target part I cannot speak for the targets.

Currently only the generated code and serialized ATN can be produced with one version of ANTLR and then attempted to be used with a different version of the runtime. Serialization means all the trees become subject to this scenario. Avoiding marking these trees as serializable moves the burden of maintaining this compatibility, if needed, to the application which needs it. Over time, when it comes to serialization this is a highly-desirable maintenance characteristic.

ftomassetti commented 4 years ago

Personally I had this need several times, the why I solved it is by using a library to complement ANTLR. Basically I translate the ANTLR's parse-tree into an AST written in this library, where I can do all sort of manipulations and transformations and solve serialize the AST to JSON or XML.

In case someone is interested, the library is called Kolasu. While the documentation is not great we have used it in several open-source and commercial projects, so it works quite decently.

parrt commented 2 years ago

Hi @ftomassetti I just started a branch: https://github.com/antlr/antlr4/pull/3773 to gen json. Is json enough for you to make a json->parse tree converter for your needs?

ftomassetti commented 2 years ago

Yes, thank you @parrt !

parrt commented 2 years ago

Cool. This is expanding into a full serialization of an entire parse (I'm playing with antlr server how is a grammar playground). Any requirements you have would be welcome sir!

bigstar119 commented 1 year ago

Hi,i have the same problem. my input is 889Byte and after parse,the ParseTree is 63KB, expansion factor nearly 6x. i tried java's native serialization/deserialization. but it's slow then reparse the input use antlr again.

Anyone has the quick way to serialization/deserialization?

HSorensen commented 1 year ago

@bigstar119 I had good results using XZ compression of the parse trees.