AST Mode - Githubissues

arithy / packcc

A parser generator for C

Other

347 stars 28 forks source link

AST Mode #54

Closed apaz-cli closed 2 years ago

apaz-cli commented 3 years ago

Continues the discussion from #51.

The main use case for a parser-generator is to build an Abstract Syntax Tree, or AST. Therefore, it would be nice if there were a cleaner way to do so built into packcc. Right now the user must, for each grammar rule which should be part of the structure, write boilerplate code to allocate and return an AST node using the return values from the other grammar rules that make it up. There is also not an easy way to associate extra information with these nodes, which would be useful for type checking and semantic analysis. These concerns could be taken care of by the parser-generator.

It's still unclear exactly how this should be done.

apaz-cli commented 3 years ago

When memory is allocated for an AST node, should it call PCC_MALLOC()? Or should the user be able to specify a different allocator for AST nodes than packcc uses internally?

arithy commented 3 years ago

It might be good to use PCC_MALLOC() also for the AST node allocator by default, and to enable users to introduce some independent allocator for it optionally. That is:

#ifndef PCC_AST_MALLOC
#define PCC_AST_MALLOC(auxil, size) PCC_MALLOC(auxil, size)
#endif

#ifndef PCC_AST_REALLOC
#define PCC_AST_REALLOC(auxil, ptr, size) PCC_REALLOC(auxil, ptr, size)
#endif

#ifndef PCC_AST_FREE
#define PCC_AST_FREE(auxil, ptr) PCC_FREE(auxil, ptr)
#endif

here, PCC_AST_MALLOC(), PCC_AST_REALLOC(), and PCC_AST_FREE() are the AST node allocator macros that can be replaced with user-defined allocator functions if necessary.

apaz-cli commented 3 years ago

I've made some progress. Take a look at the following. Not done with it yet. Tell me what you do and don't like about it so far. https://github.com/apaz-cli/packcc/blob/master/src/ast.h

Some design decisions now.

1. Is it okay if the maps for extra data are a fixed size?

More detail into how I did that in the link above. Fixed size avoids an extra call to malloc(), and I don't think people will need to put arbitrarily many items into these maps.

2. Should the nodes store pointers to their children as an array based list or a linked list?

Linked List:
- Fast to create, avoids an extra call to malloc()
- Very slow to traverse
Array List:
- Extra call to malloc()
- Fast to traverse
- Easy to grow as many children are added
Flexible Array Member List:
- Fastest to create and traverse, avoids the extra malloc() call.
- Growing the list of children requires reallocating the node
  - The size of a node being reallocated could vary drastically depending on if there's a map inside of it and how large its buffers are.
  - Reallocating a node invalidates all references to it and will probably lead to memory fragmentation.
  - Multiple confounding factors make node reallocation very slow and complicated. Not actually sure if it's possible without keeping a list of all nodes in the AST.

For now I'm leaning toward an Array list. I'm in a rush implementing things, and I think I'm going to write a bug if I go with the FAM list. An array list would be the cleanest, and I'll switch to that above. A FAM could be worth looking into in the future though, especially since there would be no source API break.

Other changes:

Add flags to the command line parser to enable/tune features.
Generate different code based on which features are on. As I've implemented it, the main difference here is number of arguments in node allocation.
When a grammar rule calls another grammar rule in AST mode, it needs to pass its node as an argument so it can be used as a parent node inside the new rule.
When a grammar rule fails, it needs to clean up its own node. This should happen recursively as the stack unwinds by checking if any rules that were called returned NULL, or on a failed direct match.

I'm not sure that I know enough about how packcc does its goto jumps to make these changes. When I was messing around earlier I just hooked into PCC_DEBUG. I'll have to study it more.

arithy commented 3 years ago

Is it okay if the maps for extra data are a fixed size?

I don't think it is good. It would be better for users to be freed from estimating the possible maximum size.

Should the nodes store pointers to their children as an array based list or a linked list?

I think a simple array based list is better as you think.

I have some more comments:

PCC_NODE_*(): I think PCC_AST_*() is better since it lets users understand this is used only for AST Mode. 'NODE' might be ambiguous.
PCC_CAPTURE_LOC: In existing documentation and source, 'position' is used rather than 'location'. To keep consistency, PCC_CAPTURE_POS is better.
GET(), SET(), ADD(), and REMOVE(): I want to add the prefix PCC_ to these macros. In PackCC, all macros defined by PackCC have it to keep the independent name space.
pcc_node__new(): It's not bad to use the same argument list in any cases.

apaz-cli commented 3 years ago

It would be better for users to be freed from estimating the possible maximum size. I think a simple array based list is better as you think.

Make up to four calls to malloc() per node? Am I hearing this right? One for the node, one for the array list of children, one for the capture sting, and one for the map?. If the answer is yes, then I'll do it. Performance may become an issue though. I'm not aware of a way to make this work without three calls to malloc(), or some wrapper.

It's not bad to use the same argument list in any cases.

The motivation behind the decision was that the macro _0 expands to a call to pcc_get_capture_string(), which expands to a call to pcc_strndup_e(), which is pretty expensive because it contains a strnlen(), a PCC_MALLOC(), and a memcpy(). We don't want to allocate, copy, and pass the value where it isn't going to be used (and potentially leaked). Users probably don't realize that $0 makes a copy every time that they type it, and I thought it might be incorrect for packcc to silently make its own mandatory copies as well.

I agree with your other assessments, and I'll start making the modifications.

arithy commented 3 years ago

For a fixed arity node, we just embed a fixed length array as a member in the node structure. We need only few children in most cases (ex. arithmetic operators, some loop statements). It depends on how a user designs the abstract syntax. For a variable arity node, we should allocate the list of children separately from the node data itself so that we can preserve the address of the node data even after expanding the list dynamically by reallocation. So, I think the FAM technique is not suitable here.

As for the macros for turning on and off the features, can we determine the states by whether they are defined or not rather than their values? That is like:

- /* Flags for turning on and off features */
- #define PCC_RULE_NAME 1
- #define PCC_CAPTURE_LOC 1
- #define PCC_CAPTURE_TEXT 1
- #define PCC_EXTRA 1

...

- #if PCC_RULE_NAME
+ #ifdef PCC_AST_RULE_NAME
    size_t rule_name;
  #endif
...

To clarify whether the macros are specific to AST Mode or not, I inserted AST_ after the prefix PCC_ (ex. PCC_RULE_NAME -> PCC_AST_RULE_NAME).

Regarding the arguments of pcc_node__new(), especially the $0 issue, I agree with you. We can deal with it using the following macros:

#ifdef PCC_AST_CAPTURE_TEXT
#define PCC_AST_GET_CAPTURE_TEXT() _0
#else
#define PCC_AST_GET_CAPTURE_TEXT() NULL
#endif

...

/* function definition */
static pcc_node_t *pcc_node__new(   /* <-- fixed argument list */
    pcc_auxil_t auxil, pcc_node_t *parent, char *rule_name, 
    size_t capture_start, size_t capture_end, char *capture_text
) {
    ...
}

...
    /* function call */
    pcc_node_t *const node = pcc_node__new(
        auxil, parent,
        PCC_AST_GET_RULE_NAME(),
        PCC_AST_GET_CAPTURE_START(),
        PCC_AST_GET_CAPTURE_END(),
        PCC_AST_GET_CAPTURE_TEXT()
    );
...

arithy commented 3 years ago

I've reflected upon the customizable payload on the AST nodes. Your solution is 'map'. I think the customization can be realized by introducing %ast_value directive that specifies the data type of the payload values. This data is embedded as a value member (not a pointer) to the node structure. Its default is nothing. To initialize and finalize it, the customizing macros PCC_AST_VALUE_INIT() and PCC_AST_VALUE_TERM() should also be introduced. The default definitions of them are ((void)0). This is a primitive solution but more general one.

In *.peg:

...

%ast_value "my_ast_data_t"

%header {
typedef struct my_ast_data_tag {
    ...
} my_ast_data_t;
}

%source {
#define PCC_AST_VALUE_INIT(auxil, ptr) my_ast_data__init(auxil, ptr) /* initializes a my_ast_data_t value pointed by ptr */
#define PCC_AST_VALUE_TERM(auxil, ptr) my_ast_data__term(auxil, ptr) /* finalizes a my_ast_data_t value pointed by ptr */
}

...

How do you think?

apaz-cli commented 2 years ago

@arithy I don't like the idea of nodes of fixed arity. There are of course parser-generators that do fixed-arity nodes, but they generally create a unique struct type for every rule based on which rules reference which. They also generate visitors for you. With arity-specific nodes, when it comes time to read the AST a lot of error prone casting is necessary, which is difficult or impossible to maintain. Whenever you change the grammar file, you must also update how and where you cast things, and the tree becomes very difficult to traverse.

This was the original idea of switching to a map. It's flexible, readable, traversable, and doesn't require difficult to maintain casting to nodes of the correct arity. If traversal isn't updated as the grammar changes it's also easy to provide an error message of exactly what needs to be changed, where, and why. However, the more that I thought about maps, the more I realized they're probably unnecessary. At first I thought "It would be cool if the user could define their own members." After that, I realized "Why shouldn't they be able to define their own node type?" So, I totally agree with you.

All AST mode has to do to be incredibly useful is to automatically allocate and free nodes. Having to write it out manually in the grammar sucks, and that's all that needs fixing. Fancy features like maps and captures are unnecessary. The user can implement their own. So, upon further reflection, I totally agree with you. I think.

When you're allocating an AST, one way or another you need to do memory allocation. They can be allocated with PCC_AST_MALLOC(). INIT and TERM doesn't make as much sense to me as NEW and FREE.

So, the sum of changes I would make are:

Add support for %ast_value
'PCC_AST_MALLOC(auxil, size)(Defaults to callingPCC_MALLOC`)
'PCC_AST_FREE(auxil, ptr)(Defaults to callingPCC_FREE`)
PCC_AST_VALUE_NEW(auxil) (Defaults to an error to force the user to define the macro)
PCC_AST_VALUE_FREE(auxil) (Defaults to an error to force the user to define the macro)
Add an AST mode command line flag
Insert call to PCC_AST_VALUE_NEW at the beginning of every rule
Insert call to PCC_AST_VALUE_FREE wherever the parser fails to accept a rule.

However, something comes to mind that I' m not sure how I would handle. The user is the one who puts things into the AST. We don't know anything about how they're stored. Now suppose a rule is called and fails, but in the process of failing successfully allocates more nodes. Those need to be freed as well, and we don't know how. One way we could deal with this is to force the user to define PCC_AST_VALUE_FREE recursively. That's what I'm leaning towards. But it may be worth thinking about a bit more.

Thanks, apaz

arithy commented 2 years ago

To introduce the AST node types of the corresponding rules, can you share your concrete implementation image? Are there as many pcc_ast_node_???_t types as rules? Or is there one type pcc_ast_node_t that has the union to hold various node data types? Due to my poor experience, I cannot have the image.

When you're allocating an AST, one way or another you need to do memory allocation. They can be allocated with PCC_AST_MALLOC(). INIT and TERM doesn't make as much sense to me as NEW and FREE.

What I thought is shown below:

When the following specification is in the PEG source,
```
%ast_value "struct my_ast_data"
```
then pcc_ast_node_t is generated like the following code.
```
typedef struct pcc_ast_node_tag pcc_ast_node_t;
struct pcc_ast_node_tag {
    ...
    struct my_ast_data extra_data;
}
```
Here, struct my_ast_data is embedded as a value not a reference, so no memory allocation is needed for creation of the struct my_ast_data object itself. Of course, it may have members that need memory or some other system resource allocation. They should be initialized by PCC_AST_VALUE_INIT(&node->extra_data), and finalized by PCC_AST_VALUE_TERM(&node->extra_data). Since NEW and INIT might imply that they allocate and free memory for the object itself, I left the naming as INIT and TERM.

apaz-cli commented 2 years ago

There would be as many pcc_ast_node_???_t types as rules, yes. And they would each store pointers of each other's types, no unions involved. This approach produces a very fast parser, but is inflexible, and I wouldn't recommend this path. What you've got going on is much more interesting than yet another static AST parser.

What you describe sounds like it would solve the problem. But, where does the node variable come from? What's inside that ...? What's wrong with using pcc_value_t in place of pcc_ast_node_t?

An idea that comes to mind is that if the answer to "What's inside pcc_ast_node_t besides the user defined type" is "nothing," then very few changes need to be made. All that's required is to add a call to PCC_AST_VALUE_INIT() which returns a pcc_value_t at the beginning of every rule, and a call to PCC_AST_VALUE_TERM() at the end, inside the L0000:; segment. Then inside of the L0001:; segment, you return the node allocated by the init macro.

apaz-cli commented 2 years ago

I have since implemented my own parser-generator, and did it a different way. I have no need to keep the issue open.