arithy / packcc

A parser generator for C
Other
347 stars 28 forks source link

Questions not answered in the README #51

Closed apaz-cli closed 3 years ago

apaz-cli commented 3 years ago

Questions

These are things that I've been wondering. Other people probably have the same questions, so they should probably be covered in the README.

  1. What is the storage duration of $$? There's actually a StackOverflow question about this. https://stackoverflow.com/questions/66145396/how-to-use-the-value-returned-by-a-packcc-parser

  2. How can I read from a file instead of stdin?

  3. Suppose I want to generate an AST. Am I supposed to generate it with $$ inside the rules manually? Is there some way to make this easier? Or is generating an AST beyond the scope of this parser generator?

  4. Can I define an action and an error action on the same rule?

  5. What's the deal with whitespace? It seems to be ignored. What if my language were whitespace-sensitive? Does it suck the whitespace out of string literals?

  6. My language has C style single and multi-line comments in it. How can I ignore those like whitespace seems to be ignored?

Thanks, and you've got a very impressive project.

arithy commented 3 years ago

I answer your questions.

  1. What is the storage duration of $$? There's actually a StackOverflow question about this. https://stackoverflow.com/questions/66145396/how-to-use-the-value-returned-by-a-packcc-parser

$$ values of respective rules are kept so that parsing can be done without any problem. Users don't need to care about their storage duration. Note that it is responsibility of users to release system resources like memory blocks associated with $$ after parsing.

As for the question in stackoverflow, the problem is not of the storage duration of $$ but that of the string $1 created by PackCC. The explanation about it is already in README, which says 'Note that the string data held by $n and $0 are discarded immediately after evaluation of the action'.

  1. How can I read from a file instead of stdin?

Use the function macro PCC_GETCHAR(auxil).

#define PCC_GETCHAR(auxil) get_character((auxil)->input)

Here, the user function get_character((auxil)->input) returns a character retrieved from (auxil)->input, which holds the information of the text input source. The argument auxil is a pointer to a user-defined structure.

As instructed in README, its data type can be specified by %auxil directive and its value can be fed via the argument auxil of the API function pcc_create().

  1. Suppose I want to generate an AST. Am I supposed to generate it with $$ inside the rules manually? Is there some way to make this easier? Or is generating an AST beyond the scope of this parser generator?

An AST node can be retained by using $$ as a pointer to it.

Simplified example is shown below.

Specification of the data type of $$:

%value "MyASTNode *"

Here, MyASTNode is some structure type defined by a user.

AST node creation in an action:

$$ = MyASTNode_create(auxil);

Here, the function MyASTNode_create() returns a pointer to an AST node newly created. To free the memory used by the AST nodes later, a user-defined structure auxil should retain their pointers.

  1. Can I define an action and an error action on the same rule?

Yes. There is no restriction.

  1. What's the deal with whitespace? It seems to be ignored. What if my language were whitespace-sensitive? Does it suck the whitespace out of string literals?

Yes. Use string literals, where white-spaces are not ignored. Strings without quotation are interpreted as names of rules.

  1. My language has C style single and multi-line comments in it. How can I ignore those like whitespace seems to be ignored?

Use the rules as below.

_  <- ( space / comment )*
comment <- '/*' ( / !'*/' . )* '*/'
space <- blank / end_of_line
blank <- [ \t\v\f]
end_of_line <- '\r\n' / '\n' / '\r'

Here, the rule _ represents white-spaces and comment blocks.

--

I think that your questions arise mainly because of lack of some practical example. I'm preparing an example to create some AST now.

ccleve commented 3 years ago

Yes, examples would be really helpful.

I'm currently struggling with the right syntax to create an iterator that supplies input over an array of ints (unicode codepoints) or bytes (UTF-8). This is a C-code thing, not a PEG thing, but it would speed development quite a bit. I'm thinking I have to create a ParseState structure with the input data and a pointer to the right getCharacter() function, and then pass it in as auxil. Haven't got it straight yet.

Example AST-generation code would be great.

You might find some use-cases for a parser here: https://github.com/antlr/grammars-v4 Pick some interesting ones and re-implement them in packcc.

arithy commented 3 years ago

You might find some use-cases for a parser here: https://github.com/antlr/grammars-v4 Pick some interesting ones and re-implement them in packcc.

Thanks for the valuable information. It's a wonderful grammar world!

apaz-cli commented 3 years ago

Thank you very much for your answers. And for an excellent parser generator.

I've taken some more time to familiarize myself with the project. I think that the way forward for me if I want to generate an AST is to fork the project and insert my own hooks into the generated parser. I want every rule to have its own $$ of type <Rule Name>Node*, with memory allocated for it in advance. That prevents me from having to write my own allocation and cast to the correct node type every time I want to store something inside the grammar. Even if $$ were consistently void* I would still have to write all the node type boilerplate anyway.

If you know a better way to do this, I'm looking forward to the AST example.

arithy commented 3 years ago

@apaz-cli, @ccleve, I have added a more practical example in examples/ast-tinyc. I hope it would be helpful to you.

ccleve commented 3 years ago

Thank you. The example is really complete. You've done quite a bit of work here. It will answer a lot of questions.

I do have one comment, though: the .peg file is hard to read. Mixing the grammar specification and C code makes it tough to see what the grammar is doing.

You might consider finding a way to do what Antlr does: completely separate the grammar and the code. Perhaps the code that creates the .c output could create stubs for every expression, or you could just put a tiny flag on those expressions that needed a stub. Then the developer could fill them in in a separate .c file.

One benefit of separating grammars and code is that your project could have more than one target language in the future. Antlr does Java, C#, Python, and several others.

On Sat, Oct 23, 2021 at 3:27 PM Arihiro Yoshida @.***> wrote:

@apaz-cli https://github.com/apaz-cli, @ccleve https://github.com/ccleve, I have added a more practical example in examples/ast-tinyc https://github.com/arithy/packcc/tree/master/examples/ast-tinyc. I hope it would be helpful to you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arithy/packcc/issues/51#issuecomment-950208588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITHKRPZUAL4HQFO2WQTELUIMLDRANCNFSM5GCLDDEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

apaz-cli commented 3 years ago

@ccleve To the contrary, I think that code and grammar being in the same file is a good idea. Probably the biggest appeal to me. I'm handling the readability issue in a different way.

First, syntax highlighting goes a really long way towards making it readable. The two editors that I use, GNU nano and VSCodium, have syntax highlighting that you can configure for .peg files. They highlight operators like /, *, +, and ?, but all the syntax highlighting really needs to be do to be invaluable is highlight the names of rules where they're being declared. Marking the token a <- points to makes the grammar a lot easier to parse visually, to the point where I don't really have a problem reading the grammar for TinyC.

The other thing to do is to take drastic steps to reduce the amount of C that you're writing inside the grammar. If it's more than one or two lines per case, you've got a problem.

@arithy Very helpful, thanks. It answers a whole lot of questions.

In my fork for my language, I'm injecting a macro into the generated parser that allocates an AST node automatically and assigns it to __pcc_out, so you don't need to type $$ = ... for every grammar rule.

These nodes are maps which you can stuff arbitrary values into. I have macros set(Type, key, value) and get(Type, key) which support this. These are potentially common names, so there would be a check to make sure these macros are not already defined. Another common operation which is to store the location of the capture. The parser generator also generates a set(size_t, "start", _0s); and set(size_t, "end", _0e); to store the location of the capture. You never have to type $$ to assign to return. set() becomes the way to pass information up the call stack. I'm not sure how I would handle error handling yet. I haven't thought about it much.

I could submit a pull request that adds this stuff, or the parts of it that you like, as an option. But it's a pretty radical change, and packcc doesn't already have any features like this. One of the main draws to the project is that it's so small, so I also understand wanting to keep it minimal. Still though it would be nice to have an "AST Mode" that makes AST generation easy, since that tends to be the main use case for a parser-generator.

Thanks, apaz

arithy commented 3 years ago

I do have one comment, though: the .peg file is hard to read. Mixing the grammar specification and C code makes it tough to see what the grammar is doing.

@ccleve, I can understand your comment, but there are pros and cons as told by @apaz-cli. Different users might have different preferences regarding it. I prefer the mixed format, so I continue to adopt only the current format.

arithy commented 3 years ago

@apaz-cli, I'm glad to know you use PackCC by customizing it to fit your project. Your idea using generic maps is interesting.

I agree that building ASTs tends to be the man use case, so "AST Mode" would be very helpful. However, there might be many options to realize ASTs. For example, some AST has nodes with various data types like your project, and some AST has nodes with one data type like my example. Each has its different merits. Since I'm wondering what it should be, I'll pass on introducing "AST Mode" now.

apaz-cli commented 3 years ago

@arithy There definitely are many ways to do it. I was very confused for a few moments when I saw your AST implementation. It's probably more commonplace than I thought, but I've never seen an AST where the sibling nodes are linked before. I've only seen ASTs where the nodes are maps, ones where pointers to the children are in array based lists, and ones where the nodes are a unique type of struct for each grammar rule. I suppose that linked lists are exactly as functional a way to do it as any other. More intuitive for sure, slower than keeping an array list in the parent node, faster than an equivalent map, but less flexible.

An AST node needs to know about its children. This is the main design choice. Usually it's some kind of list. But a node also optionally has to know about:

* Its parent? (almost always yes but technically not required)
* Its siblings? (traditionally through its parent)
* Its capture location?
* Its captured text?
* Extra stuff that the programmer can throw in

I've thought about it a bit more now. What I did with my map strategy was fixate on the "extra stuff" part. I threw everything into it, including children. I no longer think this is correct.

I'm probably going to rewrite everything in much the way you did it, with single kind of struct holding a map and a linked list of children. Since the AST node struct is fixed size, both the node and the map for "extra stuff" that goes inside of it can be allocated in a single call to malloc(). Using an array based list instead of a linked list for children would be faster to iterate over, but far slower to construct because it is not fixed size and would incur another call to malloc().

In conclusion, I propose the following. When it comes time for "AST Mode," use the same base data structure that you were before. Just add a generic map so we can do intermediate calculations, type checking, and semantic analysis inside the AST more easily. Also let us turn off features like the map and capture locations if we don't need them.

arithy commented 3 years ago

Thanks for very helpful suggestions.

Regarding the tree implementation, I just brought a basic and generic one as in text books on data structures. I agree with its performance drawback and redundancy.

As for "AST Mode", let me just add it to the want list now. You can submit a new issue for it.

By the way, to close this issue, is there anything left I should do?

apaz-cli commented 3 years ago

Nothing more to do, very happy with the answers and the feedback. Making new issue.