LPeter1997 / CppCmb

A generic C++17 parser-combinator library with a natural grammar notation.
MIT License
122 stars 8 forks source link

can this library support customized Token parsing? #10

Open asmwarrior opened 1 year ago

asmwarrior commented 1 year ago

Hi, this library is nice! Good work! Though it has many advanced C++ template code which is a bit hard for me.

Back to my question, I mean in my application, I have my own lexer(mainly the C-preprocessor) implemented myself, so my question is: can the library support parsing rules for custom lexer.

For example, the lexer has some kinds of std::vector<Token> supplied, the Token class may have definition like:

class Token
{
    TokenKind kind;
    std::string lexeme;
}

So, I would like this library to parse the Token stream not the char stream.

Any suggestions? Thanks.

LPeter1997 commented 1 year ago

Hello!

It is certainly possible, the library is completely input type agnostic. To be able to parse non-character elements, you'd pass in the sequence of tokens to the reader. The passed in source is expected to be read with an indexer operator, so a std::vector<Token> is perfectly fine to pass in. Note, that the reader won't own the source it reads from, you need to keep the vector alive.

Once you got your reader, you'd define your basic helper combinator to match on a certain token type. The combinator pc::one will work out of the box, and consume a token from the input unconditionally. Then you could define your match combinator using the builtin transformations, something like so:

template <TokenKind Kind>
bool has_token_kind(Token const& t) { return t.kind == Kind; }

template <TokenKind Kind>
inline constexpr auto match = pc::one[pc::filter(has_token_kind<Kind>)];

This will give you a parser that can match a certain kind of token (for ex. match<TokenKind::identifier>).

asmwarrior commented 1 year ago

Thanks for the reply. The final goal I want to do is try to parse the C++ source file in the Code::Blocks editor. Because the C++ grammar becomes too complex for me to write many hand-written ParseXXX functions. If you have any idea/suggestion, dose the CppCmb library is suitable for this kind of things? I mean I would like to use the CppCmb library to generate the Parser. While, the Lexer and the Preprocessor are still remain the same.

For some reasons, I need to do some type check of parsing the x*y, if x is a type, then it is a pointer definition, such as this document/cpp file did: PEGParser/type_checker.cpp at master · TheLartians/PEGParser

I think it is still a bit complex for me to tweak the CppCmb compiler. If you have time, can you help to add such sample code for a Token class?

My actual Token class could be:

class Token
{
    wxString lexeme;
    TokenKind kind;
}

Which does not use the std::string.

Another question is: is it possible to add some operator precedence climbing feature(maybe it is called pratt parser, see:Operator-precedence parser - Wikipedia )? I mean it is better for handling the expressions like a+b/c+d*e. See also: https://github.com/yhirose/cpp-peglib#parsing-infix-expression-by-precedence-climbing

I have a discussion/request here in another peg based parser here: Any method to integrate a customized lexer in your library?

LPeter1997 commented 1 year ago

To be honest, I don't think I'd use any generator or library for parsing C++, simply because it's context sensitive (well in the case of C++ it's even more complex than context sensitive). Most parsing tools primarily target context-free languages, ones that can be well-structured without any semantic analysis, CppCmb is no exception. The parser has no direct ability to feed back or forward information for other elements of a front-end (well, technically one could do it with some transformation combinators, but never played around with the idea in practice).

If you just need a "good-enough" parser, or a parser that can produce ambiguous trees that you can disambiguate with post-processing steps, you might want to take a look at TreeSitter, which was made especially for editors for incremental parsing, and parsing ambiguous constructs that can be disambiguated on the fly.

I think it is still a bit complex for me to tweak the CppCmb compiler. If you have time, can you help to add such sample code for a Token class?

I'll try to find some time during the weekend.

Another question is: is it possible to add some operator precedence climbing feature(...)?

When I wrote this library, I simply didn't feel the need, as one can encode precedence directly in the grammar, as shown in the last example of the wiki. It should be totally possible, just needs more template-metaprogramming that unrolls the precedence list into the appropriately nested grammars.

asmwarrior commented 1 year ago

Hi, thanks for the detailed reply.

I think parsing every component of the C++ code is too complex. But my idea is that a simple C++ parser can extract the tags of a source file, not the full code. For example, the function body content may be skipped when collecting the tags, this is similar like the tool: universal-ctags/ctags: A maintained ctags implementation.

I have read your sample code in your example A full expression parser, it is complex, I'm not sure, but I can't find some document that let a client user to add the semantic action function(or functor) to the matched items. If you have time, can you show me a direction on how to add such functions? This maybe useful to build a AST.

int to_num(std::vector<char> const& chs) {
    int n = 0;
    for (auto c : chs) n = n * 10 + (c - '0');
    return n;
}
...
...
cppcmb_def(num) =
      (+digit) [to_num]
    ;

It looks like the to_num is the callback function when the +digit get matched, but I'm not sure the parameter of the function it will set if we use a custom Token lexer.

Thanks.