Replace transform macros with proper reader macros

anko commented 8 years ago

At the moment, there is no programmatic access to the parser (currently sexpr-plus), which would be necessary to customise syntax (e.g. for square-bracket array notation).

This can to some degree be done by abusing transform macros (which @whacked has toyed with), but it's cumbersome and fragile, and all whitespace or other characters that the default parser swallows are inaccessible to them.

@lhorie has written a read-table-based parser that we could first modify to pass sexpr-plus' tests, then allow user code to register new read table entries.

vendethiel commented 8 years ago

That'd be amazing, but you need to decide on how you "hook" readtables ("per-files", say, like racket's #lang)

anko commented 8 years ago

@vendethiel I'm thinking compiler flag, like transform macros are now.

eslc -r eslisp-square-bracket-array -r eslisp-clojureish-quotes < code.esl > code.js

—where eslisp-square-bracket-array and eslisp-clojureish-quotes are fictional modules that define read macros.

I know only little Racket, but #lang seems to affect the whole language. I'd instead like it to be possible to turn on features one-by-one.

vendethiel commented 8 years ago

It sure would be more composable, but different modules will want different transformers (obviously), so need to integrate that correctly with the module system/builder.

dead-claudia commented 8 years ago

My recent refactorings in anko/sexpr-plus#4 seem to be promising in this area...

anko commented 8 years ago

I've been thinking hard about this feature, and sketching implementations. I think I finally have a plan for powerful enough reader macros that are simple enough to implement and maintain. It departs from the Lisp tradition of readtable-based parsers.

The following is a design overview, for critique and posterity.

For a working proof-of-concept of this type of parser, refer to the expose-subparsers branch of the sexpr-plus module (which is currently eslisp's parser), especially these tests for the API that allows modification of the parser.

Summary

The direction I'm taking is to port the parser from PEG.js to a Parsimmon. Parsimmon parsers are structured as consisting of other sub-parsers. With this hack it's possible to replace a sub-parser's behaviour without changing its identity.

Eslisp would before parsing load a user-provided JavaScript configuration file, and call it, passing all of these sub-parsers as an argument, allowing user code to make arbitrary modifications to how parsing works, then use the modified parser to parse text into an AST.

Just to be clear, by parsing I mean the process of converting a text file representing S-expressions into a data structure representing S-expressions (an abstract syntax tree, of the same format that we already use).

This would deprecate transform macros: reader macros generalise them.

Details

How would this work?

Sub-parsers would include stuff like list openers and closers (default ( and )), expressions, atoms, strings, escaped characters, comments and such.

Rough example of how the user configuration file would look:

module.exports = function(p) {
  // The argument `p` is an object passed in by the compiler.

  // Replace the atom parser with a clone of the atom parser, but mapped through
  // a function that reverses its contents.
  // This effectively makes writing (abc def) compile as if it were (cba fed).
  p.replace(
    p.sub.composite.atom.main,
    p.clone(p.sub.composite.atom.main)
      .map(function(atomAst) {
        atomAst.content = atomAst.content.split("").reverse().join("");
        return atomAst;
      })
  );

  // Could `require` modules here too, and pass the parser to them.

  // No need to return anything; the compiler retains a reference to
  // the parser object.  
};

One would compile this with something like—

eslc --config=eslcConfig.js input.esl > output.js

This also means the parsing step remains separate from the code generation step.

Why would this be good?

It allows all of the following:

Alternative syntaxes such as [a b c] → (array a b c): replace the expression sub-parser with one that reads [ ... ] and emits essentially (array ...). (Other things include different ways of writing numbers, special class syntax, etc.)
Completely replacing the parser, with one written in a different library: write the parser in whatever you like, then replace the root parser with a Parsimmon.custom parser that calls it.
Modular syntax changes: Since the config JS file can require whatever, it's easy to keep functions that change the parser in separate modules as eslisp-camelify and eslisp-propertify currently are.

How does this compare to macros in readtable-based parsers?

Most Lisp parsers (e.g. Common Lisp's and Racket's) are readtable-based, meaning they operate primarily by keeping a mapping from characters to functions that parse something that starts with that character. Hence they need read macros to register an interest in a particular start-character that they want to begin parsing from (see this tutorial on using set-macro-character).

The readtable approach has these disadvantages that I dislike:

Forcing reader macros to be always triggered by a single character limits ability to write more global syntax rules as necessary for e.g. eslisp-propertify, or a reader modification for hexadecimal numerals (e.g. 0xff), or really anything interesting.
Multiple modules modifying the readtable independently easily introduces conflicts.

The readtable approach has the advantages of being efficient and easy to implement in a readtable-based parser. I haven't run sufficient benchmarks to be sure yet, but I expect no problems.

Although a readtable-based parser would support stuff like the [ ... ]→(array ...) shorthand, it wouldn't be able to replace transform macros (e.g. eslisp-camelify and eslisp-propertify) because they need more than a readtable. I don't want to keep both transform macros and readtable-based read-macros, because their sets of capabilities partially intersect, and even their union is a subset of this sub-parser system's. I see no way to extend a readtable-based parser to also cover what transform macros do.

In contrast, being able to switch out parsers by specifically targeting how particular parts of the language are parsed is structured, resistant to conflicts, and has full generality.

Ideas for improvement? Alternatives I haven't considered?

Apologies to @isiahmeadows, who independently wrote a readtable-based parser which is unfortunately incompatible with this sub-parser idea. My thoughts have been too uncertain and incomplete until now to question your direction. If we end up going this different way, I hope you don't take it personally.

dead-claudia commented 8 years ago

@anko My parser isn't a readtable parser, but is somewhat similar. I felt compelled to address that nit.

Although I have no problem with you taking it that direction, as long as you ensure anko/sexpr-plus#1 (or I'll patch it if it isn't) and anko/sexpr-plus#3 (a prerequisite for solving several ES6 macro-related problems in Eslisp) are taken care of. I actually like that direction better.

That also gives me ideas for later as well. (I've been thinking of implementing my own S-expression-based language that isn't JS-compatible due to multiple inheritance, and that library looks a lot simpler than everything else out there, and a hell of a lot simpler than PEG.js.)

dead-claudia commented 8 years ago

Another thing: you may want to open up certain parts of sexpr-plus to allow transform macros to hook into that more effectively and natively (e.g. don't re-implement string parsing). Also, it would be a good idea to create a syntax-walker-style API for transform macros, so they can hook into the parsing more efficiently, which would be good for larger projects.

vendethiel commented 8 years ago

Well, now there's a need to find a way to write the config files with eslisp :P.

The biggest issue with Common Lisp's approach is late binding

(call-function)
%%%{1 2 3} ;; maybe this is gonna parse after call-function

... You can only report syntax errors when you get to executing that form, never before, because the forms before it might change the readtable, making the syntax valid ...

dead-claudia commented 8 years ago

@vendethiel There is that, too. And that's yet another reason why I support @anko's idea of using Parsimmon with subparsers. It's all at compile-time, and there's not nearly as much magic you have to worry about (which late binding almost always generates).

anko commented 8 years ago

@vendethiel

Well, now there's a need to find a way to write the config files with eslisp :P.

:laughing: Have to bootstrap somehow!

The biggest issue with Common Lisp's approach is late binding

I agree, but on the other hand, late binding is why you can write CL read macros in CL, instead of in a JavaScript config file… Design tradeoffs.

@isiahmeadows

Another thing: you may want to open up certain parts of sexpr-plus to allow transform macros to hook into that more effectively and natively (e.g. don't re-implement string parsing).

Exactly where I'm going with this!

For example, the anko/sexpr-plus#1 feature (ASCII/Unicode string escapes) could be implemented as a separate module like this (20 lines, heavily commented). The p.sub.composite.string sub-parser has its own .sub.escapedCharacter sub-parser that you can replace with a parser that also accepts an alternative. There's no need to touch the rest of the string-parsing logic; just that little part of it.

The parser even exports sub-parsers for customising what the whitespace characters are or how the shebang line is formatted, but you'd probably have to be insane to want to change them!

dead-claudia commented 8 years ago

@anko And one other thing with Parsimmon: it's easier to test code correctness with a monadic style.

I do have a question about how it would relate to anko/sexpr-plus#3, which is IMHO more important here than expanded string parsing (although easier to implement): would you be okay with having a separate data type for each quote? And just use those natively in Eslisp. It would simplify macro writing a lot if you're using quoted operators to disambiguate a symbol (e.g. static map key) from a standard identifier (e.g. computed key).

anko commented 8 years ago

@isiahmeadows Replied in https://github.com/anko/sexpr-plus/issues/3, here to keep the thread intact.

dead-claudia commented 8 years ago

@anko Thanks for the input. I just killed that bug because I didn't realize the existing way to check (that you mentioned).

I can't wait to see how your patch ends up, though.

anko / eslisp