edn-format / edn

Extensible Data Notation
2.6k stars 96 forks source link

Proposed formal syntax for edn #56

Open bpsm opened 11 years ago

bpsm commented 11 years ago

My purpose in writing this is to get a discussion started which will hopefully lead to hashing out a formal syntax for edn.

(This grammar is written in slightly extended version of Wirth Syntax Notation. A description is appended to the end of this document. I'm not married to using WSN. I chose it because it's familiar to me, is easily explained and is compact, but translating these productions to some other notation would be no great hardship once errors and ambiguities have been inroned out.)

Points to Consider

This syntax considers code point NUL (U+0) unacceptable in valid edn input.

This syntax forbids supurfluous leading zeros in numeric literals. (Unlike current implementations.)

This syntax does not allow unicode characters outside of the US-ASCII range to appear in symbol or keyword names.

This syntax does not allow < and > to appear in symbol or keyword names.

This syntax supports \backspace and \formfeed, "\b" and "\f", as do current implementations.

Elements

START = elements.

elements = { s element }.

element = nil | boolean | symbol | keyword | number | character
      | string
      | "{" s { element s element s } "}"
      | "#{" elements "}"
      | "[" elements "]"
      | "(" elements ")"
      | "#" tagSymbol s element.

White space is allowed between elements, but not always required. For example: {}{} parses as two empty maps. "MeaningOfLife"42 parses as a String followed by an integer. However, a:b parses as a single symbol, not as the symbol a followed by the keyword :b.

White Space

For parsing we treat comments and discarded elements to be as we do whitespace. This production describes some (possibly zero-length) run of text that is ignored by the parser:

s = { whitespace | comment | discardedElement }.

Edn, like Clojure, considers the comma (,) to be white space.

whitespace = " " | HT | LF | CR | ",".

(The syntax is written to be quite narrow here. Perhaps we should just accept "," and anything in the range U+1 … " " as white space.)

Issue#31 states that only \newline (i.e. U+A) terminates comments.

comment = ";" { commentContent } LF.

Comments can contain any code points, except LF (which terminates the comment). This is compatible with both unix (U+A) and windows (U+D U+A) style line breaks. It's incompatible with the line break style of classic Mac OS (U+D). Presumably that's no great loss.

commentContent = U+1 | … | U+9 | U+B | … | MaxCodePoint.
    (* all characters except NUL, LF. *)

discardedElement = "#_" s element.

HT  = U+9.
LF  = U+A.
CR  = U+D.

MaxCodePoint = U+10FFFF.

Symbols

symbol = "/" | [name  "/"]  name.

Issue#30 states that "tag symbols must begin with an alphabetic character", so we introduce a production for this:

tagSymbol = letter [ name "/" ] name.

Issue#32 says that :/ is not a valid keyword. Otherwise keywords adhere the naming rules of symbols following the initial :.

keyword = ":" [name "/"] name.

Names beginning with a nameStart2 must continue with a letter if they continue at all. This avoids possible ambiguity with numeric literals.

README.md, does not allow < and > to appear in symbol (or keyword) names. clojure.edn/read and clojure.core/read both accept < and > as members of nameStart1. Is this deliberate?

name = nameStart1 { nameConstituent }
     | nameStart2 [ letter { nameConstituent } ].

nameStart1 = "!" | "*" | "?" | "_" | "$" | "%" | "&" | "=" | letter.

nameStart2 = "." | "-" | "+".

nameConstituent = nameStart1 | nameStart2 | digit | "#" | ":".

letter = "a" | … | "z" | "A" | … | "Z".

digit = "0" | nonZeroDigit.

nonZeroDigit = "1" | … | "9".

The words true, false and nil look syntactically like symbols, but are not parsed as such. I've given them their own productions to hint at this:

boolean = "true" | "false".

nil = "nil".

(But, that makes this grammar ambiguous since there are now two ways of parsing "true", "false" and "nil".)

Numbers

number = integer | float.

This syntax does not allow supurfluous leading zeros in integers.

README.md (the current informal standard) does allow leading zeros.

clojure.core/read and clojure.edn/read both allow leading zeros here but interpret the remaining digits in base 8! See also Issue#33.

edn-java allows leading zeros, but gives them no special meaning.

integer = [ "+" | "-" ] cardinal [ "N" ].

The float syntax disagrees with the formal syntax from README.md, but does so in order to comply with "In addition, a floating-point number may have the suffix M to indicate that exact precision is desired."

The grammar in README.md does not allow leading zeros in the integer and exponent portions of a float. clojure.core/read, clojure.edn/read and edn-java all accept leading zeros in these cases, but given them no special interpretation.

float = [ "+" | "-" ] cardinal
        ((frac [exp] ["M"]) | (exp ["M"]) |  "M").

The fractional portion can consist of only a "." (not followed by any digits). This is consistent with the current spec and with the behavior of clojure.core/read and clojure.edn/read

frac =   "." { digit }.

exp = ("E" | "e") ["+" | "-"] cardinal.

cardinal = digit | nonZeroDigit { digit }.

Characters

character = "\" (characterName | printableCharacter).

Certain characters can be refered to by names. In their literal form these characters would display as whitespace, making the resulting edn quite confusing for a person to read or edit.

The specification only mentions the first four of these explicitly; backspace and formfeed are included for symmentry with string and because clojure.edn/read and clojure.core/read both support them.

characterName = "newline" | "space" | "tab" | "return"
              | "backspace" | "formfeed".

Characters literals for other code points are written by simply including the literal character immediate following the "\".

printableCharacter = "!" | … | "~" | U+A1 | … | MaxCodePoint.

This definition of printableCharacter is sloppy. There are probably other Unicode code points that we don't want to use in character literals because they have no printed representation or perform some control function.

Strings

string = """" {stringChar | "\" stringEscape} """".

stringChar = U+1 | … | "!" | "#" | … | "[" | "]" | … | MaxCodePoint.
    (* all code points except NUL, " and \ *)

README.md only mentions "Standard C/Java escape characters \t \r \n are supported", but clearly \\ and \" must be included. \b \f are included because they are supported by clojure.core/read and clojure.edn/read. \' is excluded despite being supported by Java because clojure.core/read and clojure.edn/read reject \'.

stringEscape = """" | "b" | "t" | "n" | "f" | "r" | "\".

Wirth Syntax Notation

Wikipedia provide as a brief description of WSN. The formal Syntax of WSN in WSN follows:

SYNTAX     = { PRODUCTION } .
PRODUCTION = IDENTIFIER "=" EXPRESSION "." .
EXPRESSION = TERM { "|" TERM } .
TERM       = FACTOR { FACTOR } .
FACTOR     = IDENTIFIER
           | LITERAL
           | "[" EXPRESSION "]"
           | "(" EXPRESSION ")"
           | "{" EXPRESSION "}" .
IDENTIFIER = letter { letter } .
LITERAL    = """" character { character } """" .

The only oddity for modern readers is that string literals do not provide anything like the \ escaping convention we have grown used to C-like languages. One embeds a double quote in a string by doubling it. All other characters in the string literal are taken literally, as written.

An Extension to WSN to Represent Unicode Codepoints

Edn's syntax is specified in terms of unicode code points, the first 128 of which are identical to US-ASCII. (Edn is always serialized as UTF-8, which can represent the full set of unicode code points.)

For the purposes of this grammar, we'll use the following extension to represent a unicode code point:

HexDigit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
         | "A" | "B" | "C" | "D" | "E" | "F".

CodePoint = "U+" HexDigit { HexDigit }.

For example: NUL is written as U+0, " " can be written as U+20 and "~" can be written as U+7F.

To represent large contiguous subsets of the unicode codepoints, we use an elipsis as follows:

HexDigit = "0" | … | "9" | "A" | … | "F".
bpsm commented 11 years ago

See also https://gist.github.com/bpsm/5951638 which is just the plain WSN with comments.

zaiquirii commented 11 years ago

This syntax disagrees with the specification with regards to whitespace before and after { } [ ]() as well as "s. As far as I can tell it does not allow opening brackets to be directly followed by a space, Nor does your example of "MeaningOfLife"42 actually work. I know you state that whitespace isn't always necessary in the comment, but is there a way to remove all ambiguity. I tried myself, using the [ ] in WSN means optional. but I was always ignoring one of the cases.

changing elements to be defined as

elements = element { s element }.

would take care of spaces around { } [ ( ), leaving just the spaces between strings and other elements and maps

bpsm commented 11 years ago

@hubrys Thanks for taking the time to review the syntax. I'll try to respond to your objections one at at time:

it does not allow opening brackets to be directly followed by a space

Let's take the production for vectors as an example:

"[" elements "]".

If we substitute in the definition of elements we get:

"[" { s element } "]".

The production for s, allows any amount of white space (including no white space, since {x} means 0 or more repetitions of x in WSN):

"[" { { whitespace | comment | discardedElement } element } "]".

Here are a few examples that would match this production: [ nil], [nil] and [ #_ nil].

But, there is a bug, the following don't match: [] and [nil ].

I believe this fixes it by allowing s following elements in collections.

element = nil | boolean | symbol | keyword | number | character
      | string
      | "{" s { element s element s } "}"
      | "#{" elements s "}"
      | "[" elements s "]"
      | "(" elements s ")"
      | "#" tagSymbol s element.

With this change [] and [nil ] are both legal.

Nor does your example of "MeaningOfLife"42 actually work.

I don't follow, "MeaningOfLife"42 parses as:

elements
    s
        <empty>
    element
        string
            "MeaningOfLife"
    s
        <empty>
    element
        number
            integer
                "42"

Am I missing something?

zaiquirii commented 11 years ago

@bpsm Ahhhh, makes sense now. For some reason I was under the impression that the { } brackets implied that the item inside of them would have to be there at least once, not 0 or more times. That actually clears up and answers all the questions I had. Thanks.

tomjakubowski commented 9 years ago

A few things I've discovered tools.reader.edn accepts, which disagree with this grammar and which disagree with the README or are not specified there.

aadrian commented 7 years ago

How about an ANTLR https://github.com/antlr/antlr4 grammar for EDN of this to be even more precise?

There seems to be already one for Clojure https://github.com/antlr/grammars-v4/tree/master/clojure

Thank you.

Marti2203 commented 4 years ago

Hello, I just made a grammar for the EDN format. If you are still interested, please check it at https://github.com/antlr/grammars-v4/pull/1831

aadrian commented 4 years ago

@Marti2203 Thank you for the grammar file !

Marti2203 commented 4 years ago

:) With pleasure, please check it and give a review as I had never heard of EDN before seeing the issue in the grammar repo