firrtl does not have context free grammar (specification 0.2.0)

Hello! I'm currently writing a firrtl lexer/parser for yosys using flex/bison and am a bit stuck because the grammar of firrtl is not context-free.

This is because firrtl allows an id to be a reserved firrtl keyword or an integer literal. (an id can be named mem, inst, b1, h32, o3)

Example of problems this causes:

field of bundle named with a keyword:
```
input a: {flip flip: UInt}
```
This makes it impossible to write the following grammar:
```
bundle_type:
    '{' bundle_fields '}'
bundle_fields:
    bundle_fields ',' bundle_field |
    bundle_field ;
bundle_field:
    flip_opt TOK_ID ':' type ;
flip_opt:
    TOK_FLIP |
    /*empty*/ ;
```
where TOK_FLIP is a token representing the keyword flip, where TOK_ID is a token representing an id (= a string)

The following bundle definition would fail this grammar:
```
input a: {flip: UInt}
```
id named after a keyword used by a stmt, e.g. inst ( signal inst: UInt<1> )
```
inst <= UInt(1)
```
The following context-free gammar does not apply, and the parser will fail.
```
stmt:
    TOK_INST TOK_ID TOK_OF TOK_ID opt_info ;
```
where TOK_INST is a token representing the keyword inst, where TOK_ID is a token representing an id (=string), where TOK_OF is a token representing the keyword of, where opt_info is a rule that matches the optional @[""] info.

This is a problem because the grammar above conflicts with:
```
stmt:
    exp <= exp ;
exp:
    id ;
```
The lexer/parser does not know how to interpret the signal inst: When the lexer matches the string inst, should it interpret it as TOK_INST or as TOK_ID (with value inst)?
id named that collides with the integer rules e.g.
```
input b1: UInt
```
b1 should be interpreted as the string/name b1 but it can also be interpreted as the 1-bit literal 0b1 by the lexer.

Root cause: This is the same problem as Java/C++/python disallowing use of reserved keywords as variables. That's why I propose to force escaping of reserved keywords by firrtl.

My proposal LLVM IR requires variables to be prefixed by %, which would solve this problem completely. I believe firrtl was designed in the image of LLVM IR, so why not copy this property too? It would make it much clearer when reading code that something is a keyword or an id:

mem     ; <= reserved keywordis a literal
%mem    ; <= id

Thanks for writing this up and providing concrete use cases.

This was discussed at the dev meeting today. There's no conclusion, yet, but people are amenable to either (in serialized form) always require % for identifiers or require % if there's a keyword collision. The latter requirement being analogous to how Scala will let you use any identifier for a val, but if you do something weird, you need to place it in backticks (val `class` = 1). It was brought up that requiring % if there's a keyword collision means that further IR API additions would cause backwards-incompatibilities with existing FIRRTL IR that uses colliding identifiers. If the latter approach is adopted, then @ekiwi suggested pre-emptively reserving keywords.

@jackkoenig had a point that ingestion of protobuf would be an easier format to work with and is available now. However, it's not clear how cleanly this would integrate with Yosys.

Two high-level goals were identified:

FIRRTL IR should be easy to write
Chisel names, which conflict with FIRRTL keywords and do not conflict with Verilog keywords, should be preserved in generated Verilog

This will need a bit more discussion.

I have been thinking about this problem again today and I believe that the following might be a good compromise between how easy firrtl is to parse and how nice it is to write and read.

plain identifiers can contain numbers, characters + some special chars, but may not be reserved keywords and may not start with a digit
all other identifiers need to be escaped using a backtick, backticks are the only character not allowed in any identifier

input a: {flip flip: UInt}

becomes:

input a: {flip `flip`: UInt}

inst <= UInt(1)

becomes

`inst` <= UInt(1)

I am not sure if input b1: UInt is really a problem since the spec specifies that literal bits need to be enclosed in ". So this should be allowed in firrtl:

node b1 = UInt<2>("b1")

Is there a bug where a firrtl producer does not properly enclose these into "?

SystemVerilog defines its own notion of escaped identifiers (IEEE 1800-2017 §5.6.1):

Escaped identifiers shall start with the backslash character (\) and end with white space (space, tab, newline). They provide a means of including any of the printable ASCII characters in an identifier (the decimal values 33 through 126, or 21 through 7E in hexadecimal).

Neither the leading backslash character nor the terminating white space is considered to be part of the identifier. Therefore, an escaped identifier \cpu3 is treated the same as a nonescaped identifier cpu3.

While backticks would be nicer (IMO), adopting the same convention than SystemVerilog would make the translation of identifiers between the two languages much simpler. Would this be considered acceptable?

I keep meaning to loop back to this issue. We've fixed all of this except for flip. Note that fixing this is from the perspective of simplifying parsing and not entirely removing context-sensitive lexing. Specifically, the places where identifiers can occur are now obvious.

All classes of (2) are replaced with operations which can't alias with identifiers, e.g., a <= b is now connect a, b.

For (3) I don't know if this was actually a problem at the time given that the string-encoded integer literals would be lexed as strings not as integer literals. The conversion to integer literal would then happen in parsing which is more acceptable to have some context sensitive information. That said, all the integers were converted to radix-specified integer literals which have a leading 0b/0h/0o/0d or decimal integer literals. The quotes were then dropped. Any number was made illegal as an identifier unless you escape it with backticks. Previously, FIRRTL allowed identifiers to be numbers. E.g., { 0 : UInt<1> } used to be legal and now must be { `0` : UInt<1> }.

The one lingering thing is flip. However, that is likely easiest by moving the flip after the ::

input a: { b : flip UInt<1> }

This may be confusing as flip is not part of the type. However, that's a bit of a nit.

chipsalliance / firrtl-spec

firrtl does not have context free grammar (specification 0.2.0) #5