duesee / abnf

A nom-based ABNF parser.
Apache License 2.0
17 stars 3 forks source link

Allow numeric values larger than 1-byte #3

Closed Nadrieril closed 5 years ago

Nadrieril commented 5 years ago

To allow for specifying any unicode char or range (%x0-10FFFF), I believe we need at least a u32 in the Range enum

duesee commented 5 years ago

I think this is reasonable. AFAIK, RFC 5234 does not give any advice on an upper limit, but supporting UTF-8 makes sense :-)

duesee commented 5 years ago

Okay, so I think the upper limit is "US-ASCII" range. The RFC also has descriptions in the form "%c##-##". If we support UTF-8 here, don't we run into problems with terminals, prose vals and case-insensitivity? At least the library gets inconsistent. What do you think?

2.3.  Terminal Values

...

   NOTE:

      ABNF strings are case insensitive and the character set for these
      strings is US-ASCII.
Nadrieril commented 5 years ago

I believe this limit only applies to the string literals, which are the ones that have to deal with case-insensitivity. I think that just means that the only way to specify a Unicode terminal is using the %x.XXXX literals. The string literals will stay limited to US-ASCII. This can be annoying to write but I think that's consistent with the RFC

Nadrieril commented 5 years ago

Here is an example of a grammar that uses those literals for Unicode chars: https://github.com/dhall-lang/dhall-lang/blob/1c8335d9362342c64d3b4ffaa2afac0eecdff209/standard/dhall.abnf#L309

duesee commented 5 years ago

Okay good. So let's keep the u32 :+1: