Question: How to use only syntax rules

cflaviu commented 1 year ago

Is it possible to create & use a grammar that will parse a vector of already parsed tokens? It means the lexical parsing of terminals is not necessary. Only syntax rules will be used, something like:

enum class token { aa, bb, cc};    
auto rule = token::aa >> token::bb >> -token::cc;
std::vector<token> tokens{token::aa, token:bb, token:cc};
auto ok = parse(tokens.cbegin(), tokens.cend(), rule);

axilmar commented 1 year ago

Yes, it is absolutely possible.

A member of an enumeration can be used to create a rule (or an expression), like this:

terminal(token::aa) << terminal(token::bb) << -terminal(token::cc)

Alternatively, since the enumeration is strongly typed, the appropriate operators could be created/

cflaviu commented 1 year ago

Thanks!! I will check it.

asmwarrior commented 10 months ago

The op needs a high level parser, which may need something code generator like bison, while he can has his own lexer like flex.

May be we can have an example code about this, I mean we need two level of parsers, one for character based to tokens, the other is token kind based parser.

axilmar commented 10 months ago

You can have any level of parsers with parserlib, it can parse anything.

On Fri, Jan 12, 2024 at 4:45 AM ollydbg @.***> wrote:

The op needs a high level parser, which may need something code generator like bison, while he can has his own lexer like flex.

May be we can have an example code about this, I mean we need two level of parsers, one for character based to tokens, the other is token kind based parser.

— Reply to this email directly, view it on GitHub https://github.com/axilmar/parserlib/issues/14#issuecomment-1888347079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESDFGP3NCBT7KRPHV37SDYOCPU5AVCNFSM6AAAAAA5XOH6POVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBYGM2DOMBXHE . You are receiving this because you modified the open/close state.Message ID: @.***>

axilmar commented 1 month ago

I added a compiler-front-end class, which takes a source input, converts it to tokens, then parses the tokens into an AST.

asmwarrior commented 1 month ago

I added a compiler-front-end class, which takes a source input, converts it to tokens, then parses the tokens into an AST.

Nice work, I will learn this code.

If i have any questions, i will post it here.

asmwarrior commented 1 month ago

Hi, in your compiler front end example, it looks like the lexer and the parser are mixed.

Is it possible to implement a pure parser, I mean from what the OP said in his first post here: https://github.com/axilmar/parserlib/issues/14#issue-1931658488

What if we have a very simple lexer or even C-preprocessor, the token class are something like:

class Token
{
    TokenKind kind;
    std::string lexeme; 
    int row;
    int column;
}

The TokenKind maybe a enum class.

So, what I need is something like below parser rule: (for example, assignment rule)

terminal(TokenKind::Identifier) << terminal(TokenKind::Asignment) << terminal(TokenKind::Identifier)

Or even better:

terminal(TokenKind::Identifier) << terminal("=") << terminal(TokenKind::Identifier)

Suppose the terminal("=") is a shortcut of terminal(TokenKind::Asignment) or even better "=".

Thanks.

axilmar commented 1 month ago

What you are asking exists in the CFE class.

I can separate the two parts of the CFE, one for lexing, and the other for parsing.

asmwarrior commented 1 month ago

What you are asking exists in the CFE class.

I can separate the two parts of the CFE, one for lexing, and the other for parsing.

Good idea, for my need, I think I need a separate high level parser, thanks.

My need for a high level parser is that I already have a low level hand written lexer and some kinds of C-Preprocessor implemented, so I only need a high level parser.

axilmar commented 1 month ago

Added example of parsing based of custom tokens and AST nodes here:

https://github.com/axilmar/parserlib?tab=readme-ov-file#Parsing

asmwarrior commented 1 month ago

Added example of parsing based of custom tokens and AST nodes here:

https://github.com/axilmar/parserlib?tab=readme-ov-file#Parsing

Thanks.

Some more questions:

When some Token pattern get matched, are there any call-back function which can be called? I mean if I need to invoke a semantic checker?

Another question is: Do I need to supply the whole TokenStream to the Parser? I mean an incremental lexer should supply tokens if it needed.

axilmar commented 1 month ago

1) When some Token pattern get matched, are there any call-back function which can be called? I mean if I need to invoke a semantic checker?

No, there is not, and this is an explicit choice for this library: since the parser is recursive-descent, a successful parsing of an inner rule can be cancelled by a failed parsing of an outer rule. Thus, semantic actions shall only be run after parsing is finished, in order to finalize which rules have been successfully parsed.

This is different from LALR(1) parsers which, upon successful parsing, guarantee that the parsing cannot be cancelled by further parsing, thus allowing semantic actions to run as soon as a rule is successfully parsed.

But the point of this library is to use recursive-descent parsing, which can host all grammars, and not only LALR or LR or SLR, and thus not have restrictions on what can be parsed.

In practice, why do you want to run semantic actions as soon as something is parsed? why not parse everything, then process the parsed tree and run semantic actions then?

2) Another question is: Do I need to supply the whole TokenStream to the Parser? I mean an incremental lexer should supply tokens if it needed.

The library uses STL container semantics.

The parser needs to know the end of the STL sequence.

In theory, you could create a lazy STL container sequence where end() returns some end-of-stream designator.

The library doesn't support lazy STL container sequences out of the box, and maybe it never does, it's not a priority, and in my humble opinion, it is not needed; I just load the whole source file in memory and process it there. In this day and age, memory is plenty, we don't need to complicate code because there is not enough memory.

On Wed, Oct 9, 2024 at 10:31 AM ollydbg @.***> wrote:

Added example of parsing based of custom tokens and AST nodes here:

https://github.com/axilmar/parserlib?tab=readme-ov-file#Parsing

Thanks.

Some more questions:

When some Token pattern get matched, are there any call-back function which can be called? I mean if I need to invoke a semantic checker?

Another question is: Do I need to supply the whole TokenStream to the Parser? I mean an incremental lexer should supply tokens if it needed.

— Reply to this email directly, view it on GitHub https://github.com/axilmar/parserlib/issues/14#issuecomment-2401550335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESDFGMLOH2N6YAREPOMSTZ2TLTJAVCNFSM6AAAAABPHXQZFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBRGU2TAMZTGU . You are receiving this because you modified the open/close state.Message ID: @.***>

asmwarrior commented 1 month ago

1) When some Token pattern get matched, are there any call-back function which can be called? I mean if I need to invoke a semantic checker?

No, there is not, and this is an explicit choice for this library: since the parser is recursive-descent, a successful parsing of an inner rule can be cancelled by a failed parsing of an outer rule. Thus, semantic actions shall only be run after parsing is finished, in order to finalize which rules have been successfully parsed. This is different from LALR(1) parsers which, upon successful parsing, guarantee that the parsing cannot be cancelled by further parsing, thus allowing semantic actions to run as soon as a rule is successfully parsed. But the point of this library is to use recursive-descent parsing, which can host all grammars, and not only LALR or LR or SLR, and thus not have restrictions on what can be parsed. In practice, why do you want to run semantic actions as soon as something is parsed? why not parse everything, then process the parsed tree and run semantic actions then?

I maintained the parser code in Code::Blocks' CodeCompletion plugin for many years, it is a very simple hand written parser to parse the C++ code and supply some kinds of code completion, but it can't parse the modern C++ code very well. We all know the parsing of C++ code is very difficult. The current code is hard to maintain and very complex. My guess is that maybe, I can use the parserlib to parse some C++ source code and simplify the logic, this may look like the tool: tree-sitter/tree-sitter-cpp: C++ grammar for tree-sitter. When parsing, I need to add symbols I found, such as a class declaration, a function definition, a variable declaration.

2) Another question is: Do I need to supply the whole TokenStream to the Parser? I mean an incremental lexer should supply tokens if it needed.

The library uses STL container semantics. The parser needs to know the end of the STL sequence. In theory, you could create a lazy STL container sequence where end() returns some end-of-stream designator. The library doesn't support lazy STL container sequences out of the box, and maybe it never does, it's not a priority, and in my humble opinion, it is not needed; I just load the whole source file in memory and process it there. In this day and age, memory is plenty, we don't need to complicate code because there is not enough memory.

OK, let's me explain my situation: in-fact I can generate all the tokens in the low level lexer(preprocessor), so that the high level parser can parse it. But my lexer class has two interface(PeekToken and ComsumeToken), so the lexer can supply tokens if the high level parser needs.

axilmar commented 1 month ago

1) a parserlib version of a c++ parser can parse the code and produce an AST tree. You can then take the AST tree and then set the code completion from the AST tree.

2) you should wrap PeekToken and ConsumeToken into an STL-like container.

On Wed, Oct 9, 2024 at 5:27 PM ollydbg @.***> wrote:

When some Token pattern get matched, are there any call-back function which can be called? I mean if I need to invoke a semantic checker?

No, there is not, and this is an explicit choice for this library: since the parser is recursive-descent, a successful parsing of an inner rule can be cancelled by a failed parsing of an outer rule. Thus, semantic actions shall only be run after parsing is finished, in order to finalize which rules have been successfully parsed. This is different from LALR(1) parsers which, upon successful parsing, guarantee that the parsing cannot be cancelled by further parsing, thus allowing semantic actions to run as soon as a rule is successfully parsed. But the point of this library is to use recursive-descent parsing, which can host all grammars, and not only LALR or LR or SLR, and thus not have restrictions on what can be parsed. In practice, why do you want to run semantic actions as soon as something is parsed? why not parse everything, then process the parsed tree and run semantic actions then?

I maintained the parser code in Code::Blocks' CodeCompletion plugin for many years, it is a very simple hand written parser to parse the C++ code and supply some kinds of code completion, but it can't parse the modern C++ code very well. We all know the parsing of C++ code is very difficult. The current code is hard to maintain and very complex. My guess is that maybe, I can use the parserlib to parse some C++ source code and simplify the logic, this may look like the tool: tree-sitter/tree-sitter-cpp: C++ grammar for tree-sitter https://github.com/tree-sitter/tree-sitter-cpp. When parsing, I need to add symbols I found, such as a class declaration, a function definition, a variable declaration.

Another question is: Do I need to supply the whole TokenStream to the Parser? I mean an incremental lexer should supply tokens if it needed.

The library uses STL container semantics. The parser needs to know the end of the STL sequence. In theory, you could create a lazy STL container sequence where end() returns some end-of-stream designator. The library doesn't support lazy STL container sequences out of the box, and maybe it never does, it's not a priority, and in my humble opinion, it is not needed; I just load the whole source file in memory and process it there. In this day and age, memory is plenty, we don't need to complicate code because there is not enough memory.

OK, let's me explain my situation: in-fact I can generate all the tokens in the low level lexer(preprocessor), so that the high level parser can parse it. But my lexer class has two interface(PeekToken and ComsumeToken), so the lexer can supply tokens if the high level parser needs.

— Reply to this email directly, view it on GitHub https://github.com/axilmar/parserlib/issues/14#issuecomment-2402502549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESDFEDTB3HHRVQBFZIIODZ2U4L3AVCNFSM6AAAAABPHXQZFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGUYDENJUHE . You are receiving this because you modified the open/close state.Message ID: @.***>

asmwarrior commented 1 month ago

1) a parserlib version of a c++ parser can parse the code and produce an AST tree. You can then take the AST tree and then set the code completion from the AST tree. 2) you should wrap PeekToken and ConsumeToken into an STL-like container.

I will try it, but as you know, C++ has such an ambiguity grammar, for example:

x*y;

This could be a pointer definition or a multiply statement, which depends on the x and y, so if I could try some semantic check code, the logic may be much simpler. The C++ grammar is context depends, so I can't parse the final AST tree when I parse all the tokens, I may be still use some incremental parsing methods.

int x;
x*y;

This could be multiply statement, because the int x is already parsed, and x is a variable.

axilmar commented 1 month ago

You can always put a multiplication-or-pointer-declaration in the AST, and do a 2nd pass and change it to either a multiplication or s pointer declaration.

On Wed, Oct 9, 2024 at 6:05 PM ollydbg @.***> wrote:

a parserlib version of a c++ parser can parse the code and produce an AST tree. You can then take the AST tree and then set the code completion from the AST tree. 2) you should wrap PeekToken and ConsumeToken into an STL-like container.

I will try it, but as you know, C++ has such an ambiguity grammar, for example:

x*y;

This could be a pointer definition or a multiply statement, which depends on the x and y, so if I could try some semantic check code, the logic may be much simpler. The C++ grammar is context depends, so I can't parse the final AST tree when I parse all the tokens, I may be still use some incremental parsing methods.

int x; x*y;

This could be multiply statement, because the int x is already parsed, and x is a variable.

— Reply to this email directly, view it on GitHub https://github.com/axilmar/parserlib/issues/14#issuecomment-2402602752, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESDFE7Z3NZLCKUBNSLWVTZ2VA5HAVCNFSM6AAAAABPHXQZFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGYYDENZVGI . You are receiving this because you modified the open/close state.Message ID: @.***>

axilmar commented 1 month ago

Forgot to mention that my plan is to get major grammars to be parsed with parserlib.

For that, what I am currently doing is writing an EBNF parser, which will allow me to convert any grammar to a parserlib program (a compiler front end) that parses the given grammar, and provide that as a library.

On Wed, Oct 9, 2024 at 6:05 PM ollydbg @.***> wrote:

a parserlib version of a c++ parser can parse the code and produce an AST tree. You can then take the AST tree and then set the code completion from the AST tree. 2) you should wrap PeekToken and ConsumeToken into an STL-like container.

I will try it, but as you know, C++ has such an ambiguity grammar, for example:

x*y;

This could be a pointer definition or a multiply statement, which depends on the x and y, so if I could try some semantic check code, the logic may be much simpler. The C++ grammar is context depends, so I can't parse the final AST tree when I parse all the tokens, I may be still use some incremental parsing methods.

int x; x*y;

This could be multiply statement, because the int x is already parsed, and x is a variable.

— Reply to this email directly, view it on GitHub https://github.com/axilmar/parserlib/issues/14#issuecomment-2402602752, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESDFE7Z3NZLCKUBNSLWVTZ2VA5HAVCNFSM6AAAAABPHXQZFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGYYDENZVGI . You are receiving this because you modified the open/close state.Message ID: @.***>

asmwarrior commented 1 month ago

Forgot to mention that my plan is to get major grammars to be parsed with parserlib. For that, what I am currently doing is writing an EBNF parser, which will allow me to convert any grammar to a parserlib program (a compiler front end) that parses the given grammar, and provide that as a library.

This is a great idea.

You have a EBNF grammar definition file. The "EBNF parser" parse it, and it generate a new parser. We use this generated parser to parse the source files.

You can always put a multiplication-or-pointer-declaration in the AST, and do a 2nd pass and change it to either a multiplication or s pointer declaration.

This should work, but I need to let the parser parse the whole source file. I mean the parser could has some mechanism that for example:

int x;
int y;
int z;
x*y;

The parser should not go back if the "int x;" is parsed already. I hope you can understand my idea, because unlimited backtrack(move back) is not a good idea.

axilmar commented 1 month ago

Actually, there is no need for backtracking at all.

The problem of ambiguity can be solved at the AST creation level, after parsing is finished: the function 'parse' has a parameter which accepts a custom function for creating an AST node, and that function can be used to create the appropriate AST node for the parsed ambiguity. e.g. the id of the match can be 'Ambiguous-Grammar-1', but the created AST node can be a 'resolved-ambiguity-1' or 'resolved-ambiguity-2' or anything else.

I will try to add an example in the documentation for this.

On Thu, Oct 10, 2024 at 3:54 AM ollydbg @.***> wrote:

Forgot to mention that my plan is to get major grammars to be parsed with parserlib. For that, what I am currently doing is writing an EBNF parser, which will allow me to convert any grammar to a parserlib program (a compiler front end) that parses the given grammar, and provide that as a library.

This is a great idea.

You have a EBNF grammar definition file. The "EBNF parser" parse it, and it generate a new parser. We use this generated parser to parse the source files.

You can always put a multiplication-or-pointer-declaration in the AST, and do a 2nd pass and change it to either a multiplication or s pointer declaration.

This should work, but I need to let the parser parse the whole source file. I mean the parser could has some mechanism that for example:

int x; int y; int z; x*y;

The parser should not go back if the "int x;" is parsed already. I hope you can understand my idea, because unlimited backtrack(move back) is not a good idea.

— Reply to this email directly, view it on GitHub https://github.com/axilmar/parserlib/issues/14#issuecomment-2403685671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESDFHF6XSCPGUOHMK4C5TZ2XF3PAVCNFSM6AAAAABPHXQZFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBTGY4DKNRXGE . You are receiving this because you modified the open/close state.Message ID: @.***>

axilmar / parserlib

Question: How to use only syntax rules #14