Closed Nadrieril closed 5 years ago
I think this is reasonable. AFAIK, RFC 5234 does not give any advice on an upper limit, but supporting UTF-8 makes sense :-)
Okay, so I think the upper limit is "US-ASCII" range. The RFC also has descriptions in the form "%c##-##". If we support UTF-8 here, don't we run into problems with terminals, prose vals and case-insensitivity? At least the library gets inconsistent. What do you think?
2.3. Terminal Values
...
NOTE:
ABNF strings are case insensitive and the character set for these
strings is US-ASCII.
I believe this limit only applies to the string literals, which are the ones that have to deal with case-insensitivity.
I think that just means that the only way to specify a Unicode terminal is using the %x.XXXX
literals. The string literals will stay limited to US-ASCII. This can be annoying to write but I think that's consistent with the RFC
Here is an example of a grammar that uses those literals for Unicode chars: https://github.com/dhall-lang/dhall-lang/blob/1c8335d9362342c64d3b4ffaa2afac0eecdff209/standard/dhall.abnf#L309
Okay good. So let's keep the u32 :+1:
To allow for specifying any unicode char or range (
%x0-10FFFF
), I believe we need at least au32
in theRange
enum