Lexer implementation - Githubissues

42-Ikole-Systems / TMK-SH

An awesome POSIX compliant shell.

MIT License

0 stars 0 forks source link

Lexer implementation #7

Open mraasvel opened 1 year ago

mraasvel commented 1 year ago

Support tokenizing all tokens as listed in the standard from a read line. https://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_10

Out of scope: here doc reading and tokenizing

mraasvel commented 1 year ago

Backticks should be iterated over until a non-escaped backtick is encountered. Even if there are nested command substitutions, backticks must be escaped in nested sequences as well as defined by the standard.

echo `echo   $( echo \`echo 1234 \` ) `

Any additional subsequent nested backticks should be prefixed by one additional escaped backslash per level. Since the Lexer only has to worry about the outer backtick, the expander will handle the complex execution logic.

echo ` \`  \\\` \\\`  \` `

mraasvel commented 1 year ago

An unquoted backslash followed by a newline should be removed from the input and not added to the history.

\ <newline>

In order to achieve this, the reader should be given to the Lexer and the Lexer will be able to remove characters from the input stream.

mraasvel commented 1 year ago

[ ] Newline handling inside of a () parenthesis operator state and maybe also the braces state {}, e.g. (\n) {\n}. The standard doesn't mention these, but bash handles them so we might want to support that.
[ ] Make a graph for all the state transitions

Tishj commented 1 year ago

I feel like these grammar rules cover this behavior?

subshell         : '(' compound_list ')'
                 ;
compound_list    :              term
                 | newline_list term
                 |              term separator
                 | newline_list term separator

Also, looking at this, it makes no sense that this doesn't use linebreak, it's literally an optional newline_list

mraasvel commented 1 year ago

Hmm I'm not sure, it seems like Bash at least has the lexer logic to look for a closing brace, for example $( ( .. ) would expand to a single word of the form $( ( ... ) and result in a syntax error in the expander instead of the lexer requesting additional input.

Since in this case it won't result in operators (, ) but a word since the $( ... ) is an expansion that is part of a word. In a non-nested layer the parser would probably notice it and request more tokens.