Document how to parse indentation based grammars

MTRNord commented 6 years ago

Hi I want to check if the prevoius char is a :. The exact example code I have is this:

Prefix: !

I want basicly to be able to check based on if the previous char is a : the ! should get a string and later reuse the look back logic to check if I am inside an array which is defined like in yaml meaning to make sure the last 2 lines of the following example don't get into the array:

Commands:
    - Trigger: afk
      Response: "{{user}} just went afk"
    - Trigger: [join_event]
      Response: "Welcome to the Room {{last_joined_user}}"
Homeserver: matrix.org
HomeRoom: {{name}}

(Commands defines a Array with 2 objects in it which both have two key value pairs in them. HomeServer and HomeRoom are outside of that array)

alecthomas commented 6 years ago

You cannot perform look-behind with Participle.

If you're trying to parse YAML, you will almost certainly need a custom lexer that includes indentation as a first-class token.

If you could describe and give examples of both situations where the ! is used that require the look-behind, I might be able to help further.

MTRNord commented 6 years ago

The ! is just any char possible. I wanted to stringify anything after the : before a new line. But thats something I have no problem with if that is impossible.

The yaml like array is something I woud like to be able to have. In that case of arrays a look behind would be nice or as you mentioned a way to check the identation to see if I am inside an array.

alecthomas commented 6 years ago

I more meant that if you show me an example of what you're trying to achieve I may be able to provide you with a grammar that will parse it.

MTRNord commented 6 years ago

Oh ok. I can do that later this day.

MTRNord commented 6 years ago

So my current code is at https://github.com/Nordgedanken/Matrix-DSL I tried to have a custom lexer based on participle but as you mentioned look-behind failed (nil pointers^^). You can find the gramar I am using at https://github.com/Nordgedanken/Matrix-DSL/blob/master/cmd/lexer/ast.go

The full example I currently have is:

[[BOT]]
Name: Test Bot
Description: Some Demo Bot
Prefix: !
Autojoin: true
Commands:
    - Trigger: afk
      Response: {{user}} just went afk
    - Trigger: [join_event]
      Response: Welcome to the Room {{last_joined_user}}
Homeserver: matrix.org
HomeRoom: {{name}}

So lets break up how I wanted to have this parsed:

[[BOT]] is a header for the type of config. This already works
From "Name" to "Autojoin" there are 4 key value pairs which I want to parse into a string. This currently fails as the default lexer requires me to have " around strings.
"Commands" starts a Array. This is exact the same as yaml defines it. But with the same type of strings as mentioned in 2. The first line of a object has a - as a prefix. The following key value pairs inside the object are on the same "level" (have the same amount of spaces in front)
[join_event] are special keys that are predefined.
Strings with {{}} around them are also special keys but reflect either predefined variables or variables from the Config. (these can be parsed after the work of participle just fine as they get replaced anyway)
"Homeserver" and "HomeRoom" are like the pairs in 1 but are after the array means I need some kind of tracking if we left the array.

MTRNord commented 6 years ago

I hope thats detailed enough or atleast the right thing

alecthomas commented 6 years ago

If I understand correctly, this will not be possible to achieve with Participle because it isn't a context-free grammar (CFG). It looks similar to YAML, which requires context across lines in order to track the current indentation level.

I will update the README to clarify that Participle can only be used to parse CFGs.

Sorry about that :(

alecthomas commented 6 years ago

Actually I'm going to correct myself here and go back to my original suggestion: this will require a custom lexer.

Basically, each time the text is indented the lexer will need to emit an Indent token and update its state to track that indentation level. When unindentation occurs the lexer will need to emit an Unindent token.

The grammar will then need to match against these tokens.

eg. (this will not work but might give you a starting point)

type Entry struct {
  Key string `@Ident ":"`
  Value *Value `@@`
}

type Value struct {
  Array []*Value `Indent @@ { @@ } Unindent`
  Boolean *bool `| ( @"true" | "false")`
}

You might also need #3 to be solved.

MTRNord commented 6 years ago

@alecthomas I will try that. Thanks for the suggestion and for the help!

apparentlymart commented 6 years ago

In case it's helpful, I've written lexers for a few indentation-sensitive languages, and found a pattern that I found helpful:

I usually first write a "normal" lexer which (unusually) considers a newline as an explicit token and considers a sequence of consecutive other whitespace characters as an explicit token. This might give you a token stream like this for part of the example above:

Ident Commands
Punct :
Newline \n
Spaces (four)
Punct -
Ident Trigger
Punct :
Spaces (one)
Ident afk
Newline \n
Spaces (six)
Ident Response
(etc)

With that lexer working, you can then wrap it in another lexer that consumes tokens from the first while tracking a stack of currently-open indent levels, passing through verbatim any tokens that aren't Spaces and synthesizing Indent and Dedent tokens as needed between Newline tokens and whatever follows them.

It's important to keep a stack of indent levels that are "open" so you can generate the right number of synthetic Dedent tokens if multiple indented blocks all close at once.

I've not tried to do the above with participle.Lexer yet, but since its interface is to just pull one token at a time implementing the above shouldn't be too tricky with the indent stack and a state flag to deal with the need to "look behind" and see if the previous token was a newline.

I'm sure some or all of the above will be familiar to some readers but I thought I'd leave this here in case it's useful to someone who finds this issue in future. Doing the indent processing separately means that a more "normal" lexer (e.g. EBNF-based) can be used for most of the scanning, as long as it doesn't filter out the whitespace when generating tokens.

alecthomas commented 6 years ago

@apparentlymart Nice idea about composing the lexers 👍

It seems like it should be possible to provide a generic IndentAwareLexer using an approach like that, assuming the underlying lexer conforms to emitting Newline and Spaces tokens, as you describe.

apparentlymart commented 6 years ago

That could be useful, indeed! Another detail I meant to mention: indent-aware grammars often have certain syntax constructs where indentation-awareness is temporarily suspended.

For example, Python's lexer suspends generation of Indent and Dedent when there is a nonzero number of bracketing delimiters open.

This can also be handled in a wrapping lexer like I was describing, but it requires a little more state tracking (a stack of open brackets, or just a count if you're not concerned about handling incorrect nesting) and recognizing those open and close bracket tokens. The hypothetical IndentAwareLexer could potentially handle this too by taking in its configuration a set of token pairs that are to be treated as suspending the synthetic Indent and Dedent token emission.

pyrossh commented 4 years ago

@apparentlymart thanks. Took me a few tries to figure it out. If someone wants to know how a basic one is done. Its here https://github.com/pyros2097/pine/blob/master/compiler/lexer.go. Its still work in progress though and is not final.

alecthomas commented 11 months ago

I have a need for an indenting lexer in a side project I'm working on. It seems to work quite well so far. Once it's a bit more battle tested I think I'll include it in Participle itself, but for now here it is:

// Package indenter provides a lexer that inserts INDENT and DEDENT
// tokens based on indentation.
//
// It relies on the parent lexer to provide an "NL" token that matches
// newlines followed by whitespace.
package indenter

import (
    "io"
    "maps"
    "strings"

    "github.com/alecthomas/participle/v2/lexer"
)

func New(parent lexer.Definition) lexer.Definition {
    out := maps.Clone(parent.Symbols())
    if _, ok := out["NL"]; !ok {
        panic("parent lexer must have an NL symbol similar to \"[\\n\\r]+[\\s\\t]*\"")
    }
    if _, ok := out["Indent"]; ok {
        panic("parent lexer must not have an Indent symbol")
    }
    if _, ok := out["Dedent"]; ok {
        panic("parent lexer must not have a Dedent symbol")
    }
    next := lexer.TokenType(-1)
    for _, t := range out {
        if t <= next {
            next = t - 1
        }
    }
    out["Indent"] = next
    out["Dedent"] = next - 1
    return &indentLexerDef{parent: parent, symbols: out}
}

type indentLexerDef struct {
    parent  lexer.Definition
    symbols map[string]lexer.TokenType
}

func (i *indentLexerDef) Symbols() map[string]lexer.TokenType {
    return i.symbols
}

func (i *indentLexerDef) Lex(filename string, r io.Reader) (lexer.Lexer, error) {
    lex, err := i.parent.Lex(filename, r)
    if err != nil {
        return nil, err
    }
    return &indentLexer{
        lexer:      lex,
        nlType:     i.symbols["NL"],
        indentType: i.symbols["Indent"],
        dedentType: i.symbols["Dedent"],
    }, nil
}

var _ lexer.Definition = (*indentLexerDef)(nil)

type indentLexer struct {
    nlType     lexer.TokenType
    indentType lexer.TokenType
    dedentType lexer.TokenType
    indents    []string
    buffered   []lexer.Token
    lexer      lexer.Lexer
}

func (i *indentLexer) Next() (lexer.Token, error) {
    if len(i.buffered) > 0 {
        t := i.buffered[0]
        i.buffered = i.buffered[1:]
        return t, nil
    }

    t, err := i.lexer.Next()
    if err != nil || t.Type != i.nlType {
        return t, err
    }

    pos := t.Pos
    i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.nlType, Value: "\n"})
    pos.Line++
    pos.Column = 1

    // At this point we have an NL symbol. Figure out if we need to
    // insert Indent tokens or Dedent tokens.
    indent := strings.TrimLeft(t.Value, "\n\r")
    for j, ind := range i.indents {
        if strings.HasPrefix(indent, ind) {
            indent = indent[len(ind):]
            pos.Column += len(ind)
            pos.Offset += len(ind)
        } else {
            k := len(i.indents) - j
            i.indents = i.indents[:j]
            for ; k > 0; k-- {
                i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.dedentType, Value: "⇤"})
            }
            break
        }
    }
    if indent != "" {
        i.indents = append(i.indents, indent)
        i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.indentType, Value: "⇥"})
    }
    t = i.buffered[0]
    i.buffered = i.buffered[1:]
    return t, nil
}

var _ lexer.Lexer = (*indentLexer)(nil)

metaleap commented 1 week ago

Thanks already for that @alecthomas — and a question on this:

Once it's a bit more battle tested I think I'll include it in Participle itself

How viable for inclusion in Participle do you feel this is by now?

alecthomas / participle

Document how to parse indentation based grammars #20