bzick / tokenizer

Tokenizer (lexer) for golang
MIT License
98 stars 6 forks source link

Handle some special characters as a part of TokenKeyword #6

Closed ryszard-swierczynski closed 7 months ago

ryszard-swierczynski commented 1 year ago

Token Keyword may contain only letters and (if configured) underscore or number. How about other special characters that may occur in some strings, for example:

"name:some-value" or even "name:@value"

Assuming that those special characters are not a part of this specific grammar, it may be nice to add a way to automatically join them into Keyword. Maybe there is a way to do that right now, but I haven't found a solution. What I can suggest is to change:

parser.go file

func (p *parsing) parseKeyword() bool { ...

if unicode.IsLetter(r) ||
            (p.t.flags&fAllowKeywordUnderscore != 0 && p.curr == '_') ||
            (p.t.flags&fAllowNumberInKeyword != 0 && start != -1 && isNumberByte(p.curr))  ||
            p.t.<someKindOfListContains>(r) {

or something like that. Currently as a temporary solution to check I've added something like that and I am receiving keywords like "some-value" and "@value".

bzick commented 1 year ago

About custom delimiters in keywords - it's make sense.

But you have to know that any rules wont be work. Tokenizer doesn't supported any rules for token sequences because it just simple parser. For example rule "after colon may by only @ or keyword" won't be work.

Currently you need to write method for your complex keyword, like

colonTokenKey := TokenKey(1) // :
atTokenKey := TokenKey(2)    // @
tokenizer := New()
// ..
stream := tokenizer.ParseString(str)
for {
    keyword := []string{}
    if stream.CurrentToken().Is(TokenKeyword) {
        keyword = append(keyword, stream.CurrentToken().ValueString())
        if stream.IsNextSequence(colonTokenKey, atTokenKey, TokenKeyword) {
            stream.NextToken()
            stream.NextToken()
            keyword = append(keyword, ":", "@", stream.CurrentToken().String())
        } else if stream.IsNextSequence(colonTokenKey, TokenKeyword) {
            stream.NextToken()
            keyword = append(keyword, ":", stream.CurrentToken().String())
        }
        // ...
    }
}

(code not tested, it is just an example)

ryszard-swierczynski commented 1 year ago

I am sorry, I think there is a little misunderstanding. I would like to skip additional logic to determine if something is a "@" or "-". In the example I have written, I wanted to use ":" as normal separator which will be tokenized and I have done this:

    parser := tokenizer.New()
    parser.DefineTokens(TColon, []string{":"})

The rest of characters like "@" and "-" - I want to treat as a part of keyword, no special meaning for them, thus I've written a simple, ugly piece of code to verify my idea, it looks like that:

if unicode.IsLetter(r) ||
            (p.t.flags&fAllowKeywordUnderscore != 0 && p.curr == '_') ||
            (p.t.flags&fAllowNumberInKeyword != 0 && start != -1 && isNumberByte(p.curr))  ||
            r == '@' || r == '-'

It seems to be working correctly, as I am receiving tokens:

{ TokenKeyword  :  name }
{ TColon  :  : }
{ TokenKeyword  :  some-value }
{ TokenKeyword  :  name }
{ TColon  :  : }
{ TokenKeyword  :  @value }

as intended. The "-" and "@" behaves in the same way as underscore character which is what I need.

bzick commented 1 year ago

I added AllowKeywordSymbols. Please try it and give me feedback.

bzick commented 7 months ago

No feedback