Open notramo opened 1 year ago
Currently, there's no way to "peek" into parsers and check the first character; either the parser runs, or it doesn't. (The implementation is entirely closures/Proc
s, which, of course, can't be inspected after construction.)
If you're OK with it running the entire parser for the error case, then yes, #not_ahead
would be the way to do it:
single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')
forbidden = single_quoted_string | double_quoted_string | semicolon
allowed_char = forbidden.not_ahead >> any(Char)
language = allowed_char.repeat(..).join
puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid")
Output:
this is ok
ParseError: expected end of input, but found 'invalid'
ParseError: expected end of input, but found "invalid"
ParseError: expected end of input, but found ;valid
I don't consider this that bad, since there's only a performance hit on erroneous input.
But if possible, I would suggest restructuring your parsers so you don't need this at all. e.g., instead of this:
string = single_quoted_string | double_quoted_string
var_name_char = string.not_ahead >> any(Char)
var_name = allowed_char.repeat(..).join
language = var_name | string
Flip the choice order for language
:
string = single_quoted_string | double_quoted_string
var_name = any(Char).repeat(..).join
language = string | var_name
|
effectively checks the first token and switches to var_name
when necessary, without additional checks in var_name
.
If you still really need the checks, but don't want the performance penalty for the error case, you'll have to manually construct a new parser, likely with none_of
:
single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')
allowed_char = none_of(['\'', '"', ';'])
language = allowed_char.repeat(..).join
puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid")
This has the same output as the first example. (Although the semantics are slightly different: The first example actually succeeds if e.g. the string is never closed)
The full concept is that I want to use this library for a shell interpreter. It would have constructs like $variables
, subshell commands: (exa -l -h)
, output redirection: > out.txt
, command separator: ;
, and so on. Then there would be the bareword, which (to make things simpler for the user) would be any character that is not space, or special language construct. String concatenation would work as in the Elvish shell, by writing two operands contiguosly without space (not the ^
character, like in BASH). This would enable the following syntax:
# vabiable interpolation: close bareword token with a $variable start
mv /tmp/$filename[0..-4] output/$filename".jxl"
# multiple commands: close bareword token with semicolon
exa -l -h; pijul status
# subshell: close bareword with subhell open (outer command), and subshell close (inner command)
kak (fd -t f src/)
Basically the parser would look like:
any_token = whitespace | single_string | double_string | variable | semicolon | subshell | closure # and any other tokens except bareword
bareword = # How to do it?
source_code = bareword | any_token
There is a more detailed code I wrote for this (currently only various strings, but no shubshell or variables), but it somehow doesn't work, and I don't know what should I change. Seems like the Crystal type system doesn't like more complex parsers.
require "parsem"
include Parsem
single_quote = token '\''
double_quote = token '"'
quote = single_quote | double_quote
# quoted string literals
single_string = single_quote >> not(single_quote).repeat(..).join << single_quote
double_string = double_quote >> not(double_quote).repeat(..).join << double_quote
quoted_string = single_string | double_string
# anything that is not bareword
any_other = quoted_string | whitespace
bareword_string = (any_other.not_ahead >> any(Char)).repeat(..).join
string_literal = bareword_string | quoted_string
sourcecode = (string_literal << whitespace).repeat(..).extend <=> string_literal
pp sourcecode.parse %(bareword "this is a string" "this is another" )
Some notes about why your code wasn't compiling:
not
requires a Token
as input, but you've passed a Parser(Token, Token)
not('\'')
instead of not(token '\'')
not_ahead
has a bug! It's not compiling when the parser's output has already been transformed into non-tokens. I'll fix that soonAlso: whitespace
is a single-character parser only. I suggest using ws
, which is just a shortcut for whitespace.repeat(..)
. Perhaps the naming could be improved here.
Here is something that I think does roughly what you want:
SINGLE_QUOTE = '\''
DOUBLE_QUOTE = '"'
# quoted string literals
single_string = token(SINGLE_QUOTE) >> not(SINGLE_QUOTE).repeat(..).join << token(SINGLE_QUOTE)
double_string = token(DOUBLE_QUOTE) >> not(DOUBLE_QUOTE).repeat(..).join << token(DOUBLE_QUOTE)
quoted_string = single_string | double_string
bareword_char = none_of [SINGLE_QUOTE, DOUBLE_QUOTE, *" \t\r\n".chars]
bareword_string = bareword_char.repeat(1..).join
string_literal = quoted_string | bareword_string
# Not using the pattern from the CSV parser
# because the delimiter (whitespace) is valid at the end as well.
# If there's any whitespace after the last string, that pattern
# eats the final whitespace, then requires another string after,
# which, of course, fails.
sourcecode = ws >> (string_literal << ws).repeat(..) << ws
puts sourcecode.parse %()
puts sourcecode.parse %(bareword "this is a string" "this is another" )
puts sourcecode.parse %( lots of bareword 'single quoted')
Unfortunate that I've had to write out all the whitespace chars. This could also work:
bareword_char = whitespace.not_ahead >> none_of [SINGLE_QUOTE, DOUBLE_QUOTE]
although it's ever so slightly less efficient.
Also, please note that I've just released version 1.1.2, which fixes an infinite loop bug in repeat
that I found while writing this code. I'll work on the other fixes/improvements this issue helped identify at some later time.
There are several tokens and
Parser
s, e.g.single_quoted_string
,double_quoted_string
,semicolon
, etc. How to parse any char that is not the start of these?It would be basically
#none_of
, but withParser
, notToken
. Or maybe some combination of#not_ahead
?