Extensions - Githubissues

Drup commented 9 years ago

I investigated the extension mechanism and I find it insufficient for my needs.

Imagine I would like to do a plugin where you include dot source and it replaces it by an svg. For that I need

To define a new type of block. inline commands like extern, link and image are not appropriate
To have knowledge of the output format, to be able to output the raw svg.

In this example, I probably could get away by using a picture pointing to a dot file, but I have other examples where It's not possible to go through an external file.

darioteixeira commented 9 years ago

Right now, extern is a block-level simple command. For added flexibility, I could offer an environment version of extern, for example:

\begin{extern}
dot commands go here
\end{extern}

The extension would have to take care of generating a SVG, and outputting a picture element linking to it. But alas, outputting SVG as a textual element within the Html5 markup would indeed not be possible within the current architecture.

Drup commented 9 years ago

How would I recognize if it's a dot thing ? In an ideal word, I would like

\begin{dot}

\end{dot}

I think the good way is to have an additional constructor block_t of type Extension of string (* name *) * string list (* options *) * string (and something similar in inline_t) and add some functions to the Extension signature for functors that handle arbitrary extensions. Maybe we can also provide a fallback in the constructor If an extensions is not recognized by an output.

darioteixeira commented 9 years ago

For environment commands, it would indeed be easy to allow for custom-named blocks such as \begin{dot}...\end{dot}. For simple commands, things are a bit more complicated because of macros: there's no way of knowing whether an undeclared \foo{...} refers to a macro or an extension. However, I think most extensions only make sense in environment commands anyway.

Anyway, I'll add support for extensible environment commands to my TODO list...

Drup commented 9 years ago

I can implement it too, if you are happy with the way I plan to do it.

The other question is : what syntax for the other readers (markdown in particular) ?

darioteixeira commented 9 years ago

It might be more efficient if I implement it, as I'm more familiar with the code. (Something like this feature was already on my mind for a while, so it's nothing unexpected.)

As for the other readers, it depends. For Markdown, I'm using OMD for the grunt work, so it would have to fit within whatever mechanisms OMD offers. For Lambwiki, however, I may consider skipping it altogether, as Lambwiki is supposed to be a minimalistic language which covers only a subset of Lambdoc anyway.

darioteixeira commented 9 years ago

@drup: I've just pushed the new extension mechanism. It's still not documented, so I recommend taking a look at the tutorial in the examples directory. Basically, it allows the definition of custom inline or block commands (within some limits, especially for the latter). The example in part 5 of the tutorial illustrates how you can create a new \banner block command which feeds its argument to the banner utility, producing a verbatim-like block with the bannerised version of the argument.

Please check the Lambdoc_core.Extcomm module for the various supported syntaxes. For your \begin{plot}...\end{plot} example, it seems the appropriate syntax would be Synblk_envraw (i.e., an environment block command taking only raw text as argument).

Also, note that for now only the Lambtex parser has support for extensions. Supporting them in Markdown and the other languages is next on my list.

Anyway, let me know if this fits your needs!

Drup commented 9 years ago

Ah, Nice! thanks you. :)

I took a look at it, I mostly like the main design. The name you chose are terrifying, though (It took me 5 minutes to realize "syn" was for "syntax") :D

My main grudge is that the ext*_t types are too complicated and syntax-focused. In the core (extinl_t and extblk_t) we only need two variants, "raw" and "not raw". you can encode all the other things in that (basically, string list * string list and string list * Intline.seq_t list for "the options, then the content").

These syntax information means something for lambtex, but not for the other formats. The core is quite format agnostic (not as much as I would like, but still), I don't think it's a good idea to introduce all those syntactic specificities.

I don't have much time to try to use it just now, but I'll try.

darioteixeira commented 9 years ago

@Drup: Thanks for the feedback. I'm well aware that because Lambtex is the only markup which currently supports extensions, there is the danger of introducing Lambtexisms into the core. This danger should be mitigated once I add extension support for other markups, though. Consider the extension mechanism a work in progress...

In the ext*_t types, the simple/environment distinction is one of those Lambtexisms. I wanted to have some way of telling Lambtex whether some extension was a simple or environment command, and that's why it exists. However, I'm considering abandoning this distinction altogether, and allowing, for example, a command banner to take the form \banner{...} or \begin{banner}...\end{banner}. For consistency sake, the two forms should also be allowed for the built-in commans such as verbatim, of course.

Nevertheless, I don't agree that only "raw" and "not raw" variants are needed. I really do want to make the distinction between block commands which require, for instance,

only raw text
only an inline sequence
raw text followed by an optional inline sequence

Granted, the AST could very well support only a generic form as you suggested, but then it would be the extension's responsibility to report misuse (eg: "you gave me an extra inline sequence, but I don't know what to do with it"). I think the AST compiler should do this job, which is why I would prefer for extensions to report the exact type of parameters they expect. Moreover, note that this is a core issue, and not a Lambtexism.

(Also, allow me to page @edwintorok, as his feedback would also be welcome.)

Drup commented 9 years ago

I think abandoning the distinction simple/env is indeed a good idea. Of course the distinction inline/block should still be there. "Raw or not raw" was a bit overly simplistic, yes. :p

I think a nice general solution would be to have something of the form

type content = Raw of string | Seq of Inline.seq_t (* we can potentially add other things *)
and extblk_t = content list

then you can just pattern match on it and decide the shape you want easily. (with any interleaving of Raw and Seq). The extension would provide a witness indicating the expected shapes to the parser, (as of your current solution): type shape = [Seq | Raw ] list (and the witness is a shape list). It preserves the current features while being simpler and giving more freedom.

It could be done with a GADT giving the shape of the accepted values and indicating the type (I might take a shot at that, just for the fun of it, but it's not necessarily a good idea :p). It would enforce that the witness and the accepted shapes do match.

In omd, the extension mechanism is done by the node X which contains output functions. We of course can't adopt this mechanism because it would hardcode the possible writers, which is precisely what we want to avoid! :)

edwintorok commented 9 years ago

@darioteixeira thanks for the extension support, it is a good start! Here is my feedback:

Inline vs block extensions

Given the answer on #26 I would agree with abandoning the distinction between simple and environment commands, i.e. Extinl_sim* would be simplified to Extinl_* (inline commands are always simple), and Extblk_sim* would be merged with Extblk_env* (block commands can be either simple or environment).

Why result_t?

There is a BatResult.t, is there a reason for defining your own?

Type for read+write extension

There is a Lambdoc_reader.Extension.S and Lambdoc_writer.Extension.S but there isn't a combined one, and combining them involves some boilerplate. Could you provide a signature for the combined extension type too?

raw vs seq and ext_t

Regarding the discussion above I think there should be a short example for each as a comment. Just by looking at their types I'm not sure how the lambtex input should look like. simraw and simseq are clear, but if I also have order/label/style parameters then how does that map to extinl_t and extblk_t? And what if I want to have multiple parameters like macros do?

I think there should be a sample extension that just dumps its input as sexp, and a sample document that exercises all the extinl_t and extblk_t variants. That might also come in handy when developing/debugging other extensions.

Support for multiple extensions

The extension type is good for a low-level extension, but only allows me to define one extension. If I want more than one extension I have to write a functor that composes multiple extensions but that requires to hardcode all possible extensions at build time. To support dynamically selected extensions there could be a combine function that takes a list of first class modules. Maybe the extension types should be further split based on simseq vs simraw? I'd prefer you provide something equivalent to combine but please the Extension module too as that allows to override the Monad too (I thought about putting the Monad signature inside the extensions in combine but I haven't figured how to write the type constraint for the first-class modules).

Writing my first extension

There is no better way to evaluate how well an extension mechanism works than by actually writing an extension for something realistic. So I started writing an extension for parsing Org mode tables using mlorg. Although is not complete yet -- Orgmode's table features don't map 1:1 to Lambdoc's (column groups are missing from lambdoc) -- I was able to parse a simple table already, and I'm quite happy with the Lambdoc side: I could find my way around inline.mli and block.mli quite easily. (I hit more problems on the mlorg side (doesn't quite parse the full org-mode table syntax, and I had to patch the build system to be able to use it as a library) than the lambdoc side.)

I'll try to write a full mlorg-to-lambdoc extension, see what problems I hit (heading labeling/numbering seems complicated for example), and report back.

darioteixeira commented 9 years ago

@edwintorok:

Why result_t?

Well, one of the things to do before a 1.0 release is to inventory all Batteries-specific functions. If their number/complexity is sufficiently low, then it might be worth putting them in a Lambdoc_util module, and thus remove the (big) dependency on Batteries. Hence, I'd rather not expose anything Batteries-specific in the API.

Type for read+write extension

The rationale behind the separation between reader and writer extensions is not obvious: basically, I want to support an architecture where the reading may take place in a different process from the writing. For resolving links and images, the reader/writer separation is crucial, but that's not the case for inline/block extensions, which can easily be done entirely on either side. Hence, I am now considering a different approach to inline/block extensions.

As for providing a combined module signature, that's easy to do once the API is settled. Until then it's just extra work that will be rendered obsolete anyway.

raw vs seq and ext_t

It's still not documented because the extension mechanism is in a state of flux...

Support for multiple extensions

Yeah, I've been thinking about the same thing. Instead of users providing a single extension, it might be better if they provide a list of independent extensions (using first class modules). This will help with combining extensions from multiple origins.

Writing my first extension

Please do go ahead and play with it. Just be aware that the API will change, so be prepared to adapt your code in the future...

Drup commented 9 years ago

Maybe we could rely on open types instead of first class modules ? It would make things as safe but much much easier to write.

darioteixeira commented 9 years ago

@Drup:

It could be done with a GADT giving the shape of the accepted values and indicating the type (I might take a shot at that, just for the fun of it, but it's not necessarily a good idea :p). It would enforce that the witness and the accepted shapes do match.

Yes, I agree that using GADTs and a type witness would be a good way of avoiding the silliness of forcing the extension to pattern match against a variant when only one of the cases is relevant (the example extensions currently use assert false for the non-relevant cases...)

Maybe we could rely on open types instead of first class modules ? It would make things as safe but much much easier to write.

Are you referring to the open extensible types introduced in 4.02? I haven't played with that feature yet. I'll have to investigate it better...

Drup commented 9 years ago

Are you referring to the open extensible types introduced in 4.02?

Yes. Basically we would have an open type extblk. A new extension would add a new variant to the type and register a function of type extblk -> ... option (writer) or ... -> extblk option (reader) that matches only this variant, returns Some ... in this case or None otherwise.

Extension non-interference would be ensured by the fact that the new variant is not exposed and his kept private, so only the defined functions know about it.

The extension handler would possess the list of functions and try them all successively until one (or none) returns an element.

darioteixeira commented 9 years ago

@Drup: that's pretty interesting, thanks! Could you recommend any paper/software that explores open extensible types, btw?

darioteixeira commented 9 years ago

@Drup and @edwintorok : I'm considering another major change to the extension mechanism: for inline/block command extensions, the reader/writer split would be eliminated. Instead, these extensions would only be available on the reader side. Moreover, instead of outputing values of type Inline.seq or Block.frag, the extensions would instead output raw Reader.AST values. This approach has the advantage that extensions may use elements (notes, bib entries) that require processing by the compiler. What do you think? (This approach also has some disadvantages, of course)

edwintorok commented 9 years ago

If you deal with just Reader.AST as output then you won't need the custom internal datatypes for extensions, so could extensions be just functions (paired with an identifier) instead of modules? If so that might simplify the extension interface, or at least composing of extensions, and as you noted allow full access to the same features you have in a lambtex input document, extensions could even define and call macros, which is something not possible with the current extensions.

So I like this idea for block and inline extensions. I'm not sure about image and link extensions as those seem more like a post-processing transformation (as opposed to defining new commands) and better fit for the current Extension module. In fact you could take this distinction further (edited):

Command extensions allows:
- define new block and inline commands as extensions
- defined similary to read_extblk and read_extinl except they output a Lambdoc_reader.Ast.t value directly
Filter extensions allows:
- have the possibility to define post-processing filters that act on Inline.seq_t and Block.frag_t
- filter that maps inline sequences to inline sequences
- filter that maps block fragments to block fragments (superset of previous)
- convenience functions that define link/image extensions as filter extensions under the hood
- generic document transforms, like automatic insertion of soft-hyphens for browser that don't support automatic hyphenation
  - generic translation support by defining a dummy language which actually outputs gettext-like format strings and parameters and leaves it up to the extension to actually format it according to the current language.

Drup commented 9 years ago

how would you define an inline construction that maps to some custom HTML that way ?

Drup commented 9 years ago

@Drup: that's pretty interesting, thanks! Could you recommend any paper/software that explores open extensible types, btw?

Sorry, I don't have any example right now. You can search in the mailing list archive a bit, but I think you will mostly find some simple tricks (to encode universe types, for examples).

darioteixeira commented 9 years ago

@edwintorok:

Yes, as I mentioned, eliminating the reader/writer split would only apply to inline/block command extensions. Resolving links/images would remain split between reader and writer, as I want to support the case that those two stages reside in different processes.

@Drup:

how would you define an inline construction that maps to some custom HTML that way ?

You wouldn't. However, note that in the current extension mechanism you can't do that anyway: extensions must output either Inline sequences or Block fragments. Note that extensions should be generic, and not tied to a particular writer like HTML. This of course limits them somewhat, but is there any concrete example where this limitation is a show-stopper?

Sorry, I don't have any example right now. You can search in the mailing list archive a bit, but I think you will mostly find some simple tricks (to encode universe types, for examples).

Alright. I'll post a message to the caml-list if any doubts show up.

darioteixeira commented 9 years ago

@edwintorok: Yes, the filter extension mechanism you suggest may be indeed be the best way forward. For convenience sake, it could also offer of an AST mapper like that provided by the new extension points feature in the OCaml compiler.

Drup commented 9 years ago

This of course limits them somewhat, but is there any concrete example where this limitation is a show-stopper?

Yes, basically all the one I want to implement. Of course the intermediate representation (in the core IR) is html-agnostic, but it's also non-encodable in the rest of the IR (afair) so I really need a custom variant.

darioteixeira commented 9 years ago

I've pushed a preliminary version of the new extension mechanism. Highlights/caveats:

Command extensions are supposed to produce Ast values. I reckon the advantages of this approach outweigh its disadvantages.
Extensions are now parameterised solely by their monad. This allows for much easier bundling of different extensions.
The distinction between inline and block command extensions remains.
Each command extension should fit into one of the syntactic patterns defined in Lambdoc_reader/Extension. Yes, some more general mechanism is probably desirable, but I reckon this suffices for now.
For the most part I got rid of the simple/env Lambtexism. However, there is actual value in distinguishing between block commands that take as parameter a brief sequence of raw text and those that take multiline verbatim-like raw text. Blksyn_raw should be used for the former and Blksyn_lit for the latter.
The chaining mechanism suggested by @drup is used for the link/image extensions. Basically, a link_reader or image_reader is supposed to return Some result if it can handle a link/image, and None otherwise. In the latter case, the next extension in the list is given a chance.
The results produced by inline and block command extensions are verified to make sure they satisfy the document sanity rules. There are good arguments in favour of relaxing this restriction for block command extensions though, and I'm planning a mechanism for it.
Command extensions don't yet have access to the ghost blocks (notes and bibs), so implementing inline footnotes as an extension as requested by @edwintorok is not yet possible. However, this is next on my plate and should be relatively simple.
I've already ported the examples to the new extension mechanism, so check out the tutorial and lambcmd_with_bookaml for a practical demonstration.

As always, feedback is welcome!

darioteixeira commented 9 years ago

In the meantime, the extension mechanism now also supports extensions that add new ghost blocks. The most obvious application is solving @edwintorok's request about inline declaration of endnotes, and I've added a new instalment to the tutorial illustrating precisely this case.

darioteixeira commented 9 years ago

I've made yet some more tweaks to the API and implementation of the extension mechanism. I reckon it is ready for more extensive testing, so please let me know what you think!

By the way, an interesting side-effect of the new extension mechanism is that it makes it trivial to embed markups within markups. I've added a new instalment to the tutorial illustrating this. Please see also this Lambtex source-file which embeds all four markups.

darioteixeira commented 8 years ago

I think this ticket may be closed for now. The extension mechanism is fairly flexible and powerful already, and I reckon it won't require any further changes before 1.0 (famous last words...). Okay with you, @Drup and @edwintorok ?

edwintorok commented 8 years ago

OK to close, I think the extension mechanism is general enough now. Might have to write a few convenience functions on top of foldmapper to make it easier to write certain kinds of extensions, but those don't necessarily have to come with lambdoc, they can be part of the extension itself. The only way to know for sure is to actually try and write some extensions. Do you have a timeframe in mind for 1.0?

darioteixeira commented 8 years ago

@edwintorok: I definitely want to refactor the Lambtex and Lambwiki parsers before 1.0. I'm just waiting for the next Menhir version to come out, which according to François Pottier should happen soon (the next Menhir version has features which should make on-the-fly lexer switching much easier and cleaner). I would also like to finalise the Lambtex language itself (fix issues #29 and #33). There are other smaller issues, but none of those are really show-stoppers.

darioteixeira commented 8 years ago

@Drup and @edwintorok: I'm closing this issue, as I'm reasonably happy with the current extension mechanism. Feel free to reopen the issue if new ideas pop up!

darioteixeira / lambdoc

Extensions #7

Inline vs block extensions

Why result_t?

Type for read+write extension

raw vs seq and ext_t

Support for multiple extensions

Writing my first extension