RomanYankovsky / DelphiAST

Abstract syntax tree builder for Delphi
Mozilla Public License 2.0
271 stars 116 forks source link

What are the aims of this project? #3

Open vincentparrett opened 10 years ago

vincentparrett commented 10 years ago

Hi All

A Delphi parser that produces a usable AST is long overdue and I'm excited to see the beginnings of one.. but I have to ask, what are the aims of or plans for this project?

I ask because the design of the parser will very much influence what is achievable. I know there are not exactly a lot of open source (or closed source/commercial) parsers out there, so it's understandable that this is based on the SimpleParser. The problem I see with this parser is that it has zero error recovery, it bails out on the first error it encounters. To make this useful for scenarios such as codeinsight, code analysis, refactoring etc, it needs to be able to recover from syntax errors (ideally without skipping too much code) and continue parsing. The lexer & parser are also a little too intertwined, for example the lexer parses compiler directives rather than the parser.

What sort of contributions are you happy to accept from the community? I'd very much like to see a more complex AST, with typed ast nodes,visitor support etc, error recovery in the parser, and perhaps a symbol table.

Not a small task though... I guess the question is where to start?.

I'm working on an IDE plugin for DUnitX (to enable running tests from within the IDE), and of course, I need a parser.

Here's some food for thought.

There are a bunch of IDE plugins that make use of syntax parsers (most likely a derivitive of the simpleparser), so for example if you have GExperts, CNPack, Castilia etc all installed, your code is getting parsed a lot on top of the IDE's own (somewhat broken) code insight parser. What if there were an IDE plugin that provided the parser as a service, which other IDE plugins could then use. Instead of multiple parsers from all the plugins, there would be one. The parser plugin would be loaded as an expert, and have an api so that other plugins can subscribe to be notified when the AST changes, and an api to force a reparse. Ideally the parse would happen in a background thread, and built it's own symbol table (could then replace error insight!)

Fix a bug in the parser (or add new language features) and all the plugins benefit. Imagine the possibilities!

Ideally, this would be something that Embarcadero would have provided as part of the tools api, but we've been asking for this for years and I'm not holding my breath.

Interested to read peoples thoughts on this.

RomanYankovsky commented 10 years ago

My first intention was to write some code that will produce a single unit's AST as an XML. It gives me an opportunity to use standard XML tools (e.g. XPath) to do some simple static analysis. For me XPath is the best way to query the tree. With XML and XPath I can easily ask questions like "give me a list of variables which start with 'abc'".

This project was not intended to be in open source, but now it is.

That's why class structure rather primitive (only TSyntaxNode class). Though more complex AST with typed AST nodes, visitor support and so forth is a good direction for future development.

Error recovery is my pain too. But I still have no time to deal with it. If somebody wish to enhance the parser in order to make it able to recover from errors, it would be great.

At this moment the main weakness is the processing of unit's interface section (data types declarations and so forth) and project/package files (I was paying the most attention to unit). Parser is in good shape but syntax tree builder still ignores many details. So in near future I plan to concentrate on filling this gap. Any contributions are appreciated. We need to have complete AST tree before we could start thinking about a symbol table.

Your idea about a common parser for all IDE experts sounds promising. Maybe one day it will become a reality.

malcolmgroves commented 10 years ago

HI Vincent and Roman,

I was also thinking along the lines of a more strongly typed representation. I would be interested in helping out with it if needed. I have this dream of being able to for..in across all functions in a class, querying for matching names, parameter types, etc.

I also ran it across some of my code that failed to parse. I'll try and narrow down a minimal failing case later this week and see what I can find.

Cheers Malcolm

vincentparrett commented 10 years ago

I don't have much time to work on it at the moment (trying to finish FB8) but I'll try and put some ideas to words and post here for comment. I'd hate to see multiple people all working on the same thing which then becomes a merge nightmare. I guess it would be good to define some goals for the project, what are we trying to achieve, what types of usage are envisaged etc.

RomanYankovsky commented 9 years ago

It is hard to define specific goals... How about "to produce a fully qualified abstract syntax tree"? Maybe you have more specific ideas?

I think we could start with making the tree representation typed.

quartexNOR commented 9 years ago

Before XMAS i finished the proposal called LDEF, a language definition intermediate format, better known as an AST. The purpose of LDEF was to provide first and foremost a binary, portable format suitable for source to source compilation, analysis; being able to extract not just information about classes and their members, but also about per-line operations (read: the actual code).

LDEF is more or less a bog standard in-memory class hiearchy capable of representing most languages and their features, such as object pascal, C# and to some extent C++ (no support for multiple ancestors).

My personal interest is source-to-source compilation, so naturally I am biased towards that. But the information gathering is exactly the same as that required for analysis of code and/or "IDE plugins".

LDEF was planned for the coming spring, but since DelphiAST already does most of what i wanted, I think I may save a lot of time by building on that instead.

One thing that would be extremely cool would be to use lex/yacc with this. That way you would not just be able to parse object pascal, but also C# and all the other languages and produce the same compatible AST. With a codegen capable of reading the XML you would in fact have a 1-to-many compiler chain.

Although it would be useless without an RTL

bkisdi commented 9 years ago

My vision about this project is a two (or three) phase parsing:

The first phase is tokenizing: separate keywords, operators, literals, identifyers, comments, white space etc. - and syntax errors (such as unterminated string literals). This produces a linear list of tokens, with line and col position, token kind and user-defined data. This list is useful (and should contain enough information) for a syntax-highlighter, for example. The program should be able to export this list separately. Concatenating the tokens gives back the original source code.

The second phase works on the list above, and builds a syntax tree from the tokens, revealing their structure. This phase should keep the difference between Beep; and begin Beep; end;, for example. Similarly, all compiler directives and semantic errors (such as undeclared identifyers) are kept in this level. This tree is useful for a source-code reformatter. This level should be (and more or less is) the AST. Traversing this tree should give back a syntactically equivalent version of the original source code.

The third phase would be a semantic tree. In this level interface and implementation can be merged, current compiler directives are applied and structures are expressed in language (Delphi)-independent concepts as much as possible. This level is similar to the LDEF proposed by @quartexNOR. This tree is useful for an interpreter or compiler (source-to-source, or source-to-machine code). Traversing this tree should give back a semantically equivalent version of the original source code.

Put it together, and you have Delphi... I think all these levels (or a mix of them) are present in the Delphi environment; it would be great (and more accurate) to extract it somehow from there. For an efficient compiler and IDE these levels are mixed and a lot of language-specific assumptions are built into, but for our project these three should be separated, I think.

bkisdi commented 9 years ago

My task is to migrate our company's projects (hundreds of units) from Delphi to C#... :X

It seems to be a mission impossible (https://www.linkedin.com/groups/How-migrate-Delphi-application-C-40949.S.252441991), but I think that having an AST (or LDEF) is half of the way to an automated conversion - I hope it can do most of the typing of something that looks C#.

To start with a simpler task (see my fork at https://github.com/bkisdi/DelphiAST), I wanted to restore the original Delphi unit from the AST as much as possible. I'm not ready with this, but now I see a number of issues with missing information. What do you think about fixing these?

The first issue I faced with is: keeping comments. (Coincidence with issue #39 of @LaKraven.) I know, comments are "around" the syntax tree, not in it. More precisely, the syntax tree contains comments, but the semantic tree doesn't. (See my comment above.) The current AST is a strange mix of these, with a lot of anomalies. (Well, it is the closest one to a good solution, anyway.)

The ASSIGN node, for example, contains LHS and RHS subnodes, which are semantic, and those in turn contain IDENTIFIER and EXPRESSION subnodes, which are the actual syntactic items. But the FOR, WHILE etc. nodes lack these levels.

I'd like to see a complete list of the XML nodes produced by DelphiAST.

Some actual coding problems:

to, downto or in is missing from the FOR node. Unfortunately TmwSimplePasPar.ForStatement can't access FStack.Peek in TPasSyntaxTreeBuilder.ForStatement.

The kind attribute is missing from the METHOD node in the interface section.

private, protected, public, published etc. should be added to the METHOD node as attribute.

Please use unambiguous node names - it helps navigating in the AST. For example, UNIT was used in two different senses - I had to change one of them to USEDUNIT. TYPE and CALL are another examples of ambiguous nodes. I think the constants in DelphiAST.Consts should be converted to an enumeration.

TmwBasePasLex.BorProc; and part of TmwBasePasLex.BraceOpenProc; seems to do the same work; similarly TmwBasePasLex.AnsiProc; and TmwBasePasLex.RoundOpenProc;, but the formers are never executed (in my examples).

I don't see the role of the TmwPasCodeInfo enumeration. It seems to be unused.

LaKraven commented 9 years ago

@bkisdi once the Symbol Table is ready (to resolve an identifier back to its original definition between units), DelphiAST would be suitable for use in making a language-to-language conversion tool. It would, of course, be up to you to handle the casing for generating your output syntax, but all of the necessary information should be there to do just that.

I've been busy with other work the last few weeks, which is the reason why the Symbol Table isn't already completed... but (fortunately) my assignment for this weekend REQUIRES a working Symbol Table in DelphiAST, so as you can imagine I'll have to complete it in order to do the work the client is paying me for!

RomanYankovsky commented 9 years ago

@bkisdi you are right, unfortunately some information is missing from AST. It will be better to open separate issues for them.

I'll take a look at ForStatement implementation. I think I can fix this soon.

quartexNOR commented 9 years ago

bkisdi: The biggest problem with converting to a particular language and framework, is that most delphi applications relies heavily on the VCL. Some tidbits of the VCL can be easily ported/recognized, such as showmessage() etc.. which have their equivilents in the .net framework, but take something like database access and custom DB objects (ORM) and it will be harder.

But object pascal as a language is more than capable of being ported to C# or something else.