ThatsJustCheesy / parsem

Parsec-like parser combinators for Crystal
MIT License
12 stars 0 forks source link

How to parse “anything that is not this Token or that Token” #1

Open notramo opened 1 year ago

notramo commented 1 year ago

There are several tokens and Parsers, e.g. single_quoted_string, double_quoted_string, semicolon, etc. How to parse any char that is not the start of these?

It would be basically #none_of, but with Parser, not Token. Or maybe some combination of #not_ahead?

ThatsJustCheesy commented 1 year ago

Currently, there's no way to "peek" into parsers and check the first character; either the parser runs, or it doesn't. (The implementation is entirely closures/Procs, which, of course, can't be inspected after construction.)

If you're OK with it running the entire parser for the error case, then yes, #not_ahead would be the way to do it:

single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')

forbidden = single_quoted_string | double_quoted_string | semicolon
allowed_char = forbidden.not_ahead >> any(Char)

language = allowed_char.repeat(..).join

puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid")

Output:

this is ok
ParseError: expected end of input, but found 'invalid'
ParseError: expected end of input, but found "invalid"
ParseError: expected end of input, but found ;valid

I don't consider this that bad, since there's only a performance hit on erroneous input.

But if possible, I would suggest restructuring your parsers so you don't need this at all. e.g., instead of this:

string = single_quoted_string | double_quoted_string

var_name_char = string.not_ahead >> any(Char)
var_name = allowed_char.repeat(..).join

language = var_name | string

Flip the choice order for language:

string = single_quoted_string | double_quoted_string

var_name = any(Char).repeat(..).join

language = string | var_name

| effectively checks the first token and switches to var_name when necessary, without additional checks in var_name.

If you still really need the checks, but don't want the performance penalty for the error case, you'll have to manually construct a new parser, likely with none_of:

single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')

allowed_char = none_of(['\'', '"', ';'])

language = allowed_char.repeat(..).join

puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid")

This has the same output as the first example. (Although the semantics are slightly different: The first example actually succeeds if e.g. the string is never closed)

notramo commented 1 year ago

The full concept is that I want to use this library for a shell interpreter. It would have constructs like $variables, subshell commands: (exa -l -h), output redirection: > out.txt, command separator: ;, and so on. Then there would be the bareword, which (to make things simpler for the user) would be any character that is not space, or special language construct. String concatenation would work as in the Elvish shell, by writing two operands contiguosly without space (not the ^ character, like in BASH). This would enable the following syntax:

# vabiable interpolation: close bareword token with a $variable start
mv /tmp/$filename[0..-4] output/$filename".jxl"

# multiple commands: close bareword token with semicolon
exa -l -h; pijul status

# subshell: close bareword with subhell open (outer command), and subshell close (inner command)
kak (fd -t f src/) 

Basically the parser would look like:

any_token = whitespace | single_string | double_string | variable | semicolon | subshell | closure # and any other tokens except bareword

bareword = # How to do it? 

source_code = bareword | any_token

There is a more detailed code I wrote for this (currently only various strings, but no shubshell or variables), but it somehow doesn't work, and I don't know what should I change. Seems like the Crystal type system doesn't like more complex parsers.

require "parsem"

include Parsem

single_quote = token '\''
double_quote = token '"'

quote = single_quote | double_quote

# quoted string literals
single_string = single_quote >> not(single_quote).repeat(..).join << single_quote
double_string = double_quote >> not(double_quote).repeat(..).join << double_quote
quoted_string = single_string | double_string

# anything that is not bareword
any_other = quoted_string | whitespace

bareword_string = (any_other.not_ahead >> any(Char)).repeat(..).join

string_literal = bareword_string | quoted_string

sourcecode = (string_literal << whitespace).repeat(..).extend <=> string_literal

pp sourcecode.parse %(bareword "this is a string" "this is another" )
ThatsJustCheesy commented 1 year ago

Some notes about why your code wasn't compiling:

Also: whitespace is a single-character parser only. I suggest using ws, which is just a shortcut for whitespace.repeat(..). Perhaps the naming could be improved here.

Here is something that I think does roughly what you want:

SINGLE_QUOTE = '\''
DOUBLE_QUOTE = '"'

# quoted string literals
single_string = token(SINGLE_QUOTE) >> not(SINGLE_QUOTE).repeat(..).join << token(SINGLE_QUOTE)
double_string = token(DOUBLE_QUOTE) >> not(DOUBLE_QUOTE).repeat(..).join << token(DOUBLE_QUOTE)
quoted_string = single_string | double_string

bareword_char = none_of [SINGLE_QUOTE, DOUBLE_QUOTE, *" \t\r\n".chars]
bareword_string = bareword_char.repeat(1..).join

string_literal = quoted_string | bareword_string

# Not using the pattern from the CSV parser
# because the delimiter (whitespace) is valid at the end as well.
# If there's any whitespace after the last string, that pattern
# eats the final whitespace, then requires another string after,
# which, of course, fails.
sourcecode = ws >> (string_literal << ws).repeat(..) << ws

puts sourcecode.parse %()
puts sourcecode.parse %(bareword "this is a string" "this is another" )
puts sourcecode.parse %( lots of bareword 'single quoted')

Unfortunate that I've had to write out all the whitespace chars. This could also work:

bareword_char = whitespace.not_ahead >> none_of [SINGLE_QUOTE, DOUBLE_QUOTE]

although it's ever so slightly less efficient.

Also, please note that I've just released version 1.1.2, which fixes an infinite loop bug in repeat that I found while writing this code. I'll work on the other fixes/improvements this issue helped identify at some later time.