commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.6k stars 535 forks source link

Extension support in libcmark #100

Closed MathieuDuponchelle closed 8 years ago

MathieuDuponchelle commented 8 years ago

Hello, I always see "extensions" mentioned on discussions of features in CommonMark (for example the discussion about tables).

Does libcmark itself support actual extensions, and if so is there any guide on how to implement one, and an index of common extensions, or are extensions purely conceptual extensions to the specification, up to individual implementations to add ?

I'm pretty sure I could just read the code and find out but

Thanks for all your work on CommonMark / pandoc!

jgm commented 8 years ago

libcmark currently has no mechanism for extensions.

MathieuDuponchelle commented 8 years ago

Are there any plans for this ?

jgm commented 8 years ago

No definite plans at the moment. Of course you can do quite a lot of customization already by walking the AST with the iterator interface. See jgm/lcmark for a system that makes this easy to do with lua filters.

MathieuDuponchelle commented 8 years ago

Interesting, thanks, however afaiu some extensions will need to define rules at the parsing stage? ie for example the table extension would need to define new node types, and set state such as IN_TABLE when the beginning of a table is encountered?

MathieuDuponchelle commented 8 years ago

If that was not clear, I like tables :)

MathieuDuponchelle commented 8 years ago

By the way if you're interested in documenting libcmark's API with CommonMark using the tool I'm developing (here's a demo video of it I made two weeks ago -> https://www.youtube.com/watch?v=jQd1zJIJQRE) I can try and give it a go :)

jgm commented 8 years ago

Interesting, thanks, however afaiu some extensions will need to define rules at the parsing stage? ie for example the table extension would need to define new node types, and set state such as IN_TABLE when the beginning of a table is encountered?

Yes, that's much harder to accomplish. (Though currently you could do it in a sort of hackish way, for example, by including the table in a processing instruction

<?table
| my | table |
|----|-------|
| 2  | 3     |
?>

which a filter can intercept, process, and replace with raw HTML or whatever is the target format.)

The javascript library markdown-it contains a flexible extension mechanism; you might look at it.

MathieuDuponchelle commented 8 years ago

The javascript library markdown-it contains a flexible extension mechanism; you might look at it.

I kind of want to use libcmark itself as it's hell of fast, also for my customization needs I currently use commonmark-py , as my tool is written in python

MathieuDuponchelle commented 8 years ago

To be more specific: the bits I need to parse and slightly modify I parse with commonmark-py, the rest I directly render using libcmark.

jgm commented 8 years ago

The API is already documented using CommonMark! libcmark is used to convert CommonMark code comments in cmark.h into the cmark (3) man page.

+++ Mathieu Duponchelle [Feb 04 16 11:13 ]:

By the way if you're interested in documenting libcmark's API with CommonMark using the tool I'm developing (here's a demo video of it I made two weeks ago -> [1]https://www.youtube.com/watch?v=jQd1zJIJQRE) I can try and give it a go :)

— Reply to this email directly or [2]view it on GitHub.

References

  1. https://www.youtube.com/watch?v=jQd1zJIJQRE
  2. https://github.com/jgm/cmark/issues/100#issuecomment-180007808
MathieuDuponchelle commented 8 years ago

Cool :)

My tool is a bit higher-level though, as it joins source code parsing (done with clang) and documentation parsing (done with CommonMark), and might be a bit more convenient for documenting an API :)

MathieuDuponchelle commented 8 years ago

Oh one more thing btw, adding gobject-introspection support to libcmark would be extremely freaking great ! I suppose you could do so by simply encapsulating data types as "boxed types", but a full port to gobject might even make more sense? I know this kind of deviates from the original subject, but if you're interested I can open a separate issue?

MathieuDuponchelle commented 8 years ago

To be clear: it would make it easy to expose the full API of libcmark to javascript and python.

jgm commented 8 years ago

Sorry, I don't know what gobject is.

+++ Mathieu Duponchelle [Feb 04 16 11:27 ]:

Oh one more thing btw, adding gobject-introspection support to libcmark would be extremely freaking great ! I suppose you could do so by simply encapsulating data types as "boxed types", but a full port to gobject might even make more sense? I know this kind of deviates from the original subject, but if you're interested I can open a separate issue?

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/cmark/issues/100#issuecomment-180013103
MathieuDuponchelle commented 8 years ago

https://developer.gnome.org/gobject/stable/gobject-The-Base-Object-Type.html

You don't know that GObject? We must be living on two very separate islands of the Open Source archipelago!

You seem to be writing somewhat object-oriented C in this project (with new constructors and what not).

GObject is the most standard implementation of the object concept in C, I would strongly advise giving it a good look :)

gobject-introspection generates "gir files" based on C source code and annotated comments in the source, these files are then read at runtime by projects like pygobject, which make the API transparently available from python (with some caveats, for example it's impossible to "call" function macros from python obviously, but you can still write manual "overrides" file if you really want to)

I'd really love having access to the full extent of the CommonMark API from python, instead of the very limited wrapper that currently exists (cmarkpy only exposes "markdown_to_html"), and this solution is the best one I know from a maintenance perspective.

jgm commented 8 years ago

+++ Mathieu Duponchelle [Feb 04 16 13:04 ]:

[1]https://developer.gnome.org/gobject/stable/gobject-The-Base-Object-T ype.html

You don't know that GObject? We must be living on two very separate islands of the Open Source archipelago!

That's undoubtedly true. I'm mostly a Haskell programmer; this is my first C project in a long while.

You seem to be writing somewhat object-oriented C in this project (with new constructors and what not).

GObject is the most standard implementation of the object concept in C, I would strongly advise giving it a good look :)

gobject-introspection generates "gir files" based on C source code and annotated comments in the source, these files are then read at runtime by projects like pygobject, which make the API transparently available from python (with some caveats, for example it's impossible to "call" function macros from python obviously, but you can still write manual "overrides" file if you really want to)

I'd really love having access to the full extent of the CommonMark API from python, instead of the very limited wrapper that currently exists (cmarkpy only exposes "markdown_to_html"), and this solution is the best one I know from a maintenance perspective.

It should be trivial to make the whole API available from python using ctypes. Take a look at wrappers/wrapper.py in the cmark repository. That makes markdown_to_html available with three lines; you'd have to write three lines for each function to get the whole API.

Alternatively you could probably use swig to generate bindings. That's what I did in cmark-lua. I haven't explored doing it for python, but it was pretty painless for lua.

jgm commented 8 years ago

I want to keep library dependencies minimal. And I don't think GObject is needed to expose the full API to python, etc.

MathieuDuponchelle commented 8 years ago

I want to keep library dependencies minimal.

Understandable, GObject and the GLib are very established projects though, known to build and work on all the platforms I can think of and then some.

I don't think GObject is needed to expose the full API to python

Of course it isn't, but experience taught me that this solution was the superior one :)

I'll certainly clone the library and have a better look at its implementation in the near future, for now I won't mind if you close that issue, thanks for the feedback !

MathieuDuponchelle commented 8 years ago

Closing

MathieuDuponchelle commented 8 years ago

Reopening after reading the code while working on https://github.com/jgm/cmark/pull/103 , I think I understand the parsing process a bit better now, and I'd be interested in your opinion on the best way to add "hooks" for extensions to plug upon during the parsing process.

I think the first step needed would be to document the block parsing algorithm, and the cmark_parser structure.

My "high-level" understanding of the algorithm is as follows:

I might very well be very wrong on some of my interpretations of the code, but I can kind of see how this process would create a kind of valid ast in the end :)

However, the code in there still needs a lot of cleanup and refactoring to make it easily legible, for example:

I am also not really clear on how the backtracking works exactly.

I have currently a limited intuition of how extending the block parsing part of the process could be done (I still haven't looked at either inlines or rendering), basically the extension would "register" a new block type, then be called during both the first and second part of the process to determine whether the processed line could match a block type defined by the extension, then whether this new block could be contained by the current container. The extension should also provide a way to make the check the other way around, ie could a new block be a child of its block type.

Am I correct in some of my assumptions here? When I manage to understand the process completely, and the proposed refactoring and documentation work are done, I'll be happy to help with designing and implementing an extension interface :)

jgm commented 8 years ago

There's a bit of high-level documentation of the parsing strategy here: http://spec.commonmark.org/0.24/#phase-1-block-structure Does that help?

I'd certainly be receptive to PRs that improve readability, e.g. by making functions for some of the complicated tests. (As long as they don't significantly affect performance.)

MathieuDuponchelle commented 8 years ago

Yep, that's pretty much what I tried to do, except better :)

My suggestions for now would be :

I may not find time to do this soon, but it really should be done in order to improve the overall maintainability of this specific chunk of code :)

MathieuDuponchelle commented 8 years ago

Hello, so as promised in https://github.com/jgm/cmark/pull/104 , I have implemented a first draft of "block parsing extensions".

Important disclaimer: the implementation linked below is not intended to be final, that's why I didn't make a pull request, this is initial work to help me figure out whether my understanding of the block parsing process was adequate, discover design issues that should be solved, and provide us with some talking points.

Here it is : https://github.com/MathieuDuponchelle/cmark/commits/extensions

I have provided an example extension, that parses simple piped tables.

To test it out, check out my branch, build it with the usual commands, create a file named "foo.md" containing :

| First *yes* Header | Second   |
| ------------------ | -------- |
| Something in [here](http://here.com) that is very long | And here |
And I make a new paragraph here, totally works

then run:

./build/src/cmark --extension piped-tables foo.md -t html

You should see an html table, woohoo :)

Note that after writing all this old-school C, I would strongly recommend using the glib in libcmark. My current implementation is full of shortcuts which can and will fail given the chance, I did not want to distract myself while developing this draft, as if we agree with using it the glib provides plenty of helper functions to solve these for us.

I will mention these shortcuts along my self-review.

The current implementation certainly leaks a little too btw

I'm now going to go over the commits, and detail the issues I've encountered and how they are currently solved, please comment here and not on the commits themselves so we can keep a trace of the process after future rebases :)

Extensions: first draft (https://github.com/MathieuDuponchelle/cmark/commit/70fa8a0035b0737e30905759a396fba9c4179c8b)

Making cmark link with "CMAKE_DL_LIBS"

Extensions need to be wrapped as plugins, as we want third party to be able to implement their own extensions once an API is decided. I dumbed down http://eli.thegreenplace.net/2012/08/24/plugins-in-c for this purpose, ideally I'd like to use https://developer.gnome.org/glib/stable/glib-Dynamic-Loading-of-Modules.html instead.

I define LIBDIR at compile-time in order to have a place to look for "standard extensions", in the future the cmark executable could expose a command-line argument to discover plugins in other places, and "discover_plugins" could also look in an environment variable defined path.

Some fiddling will have to be done for the statically linked cmark I suppose.

Exposing internals for use by the plugins

I've made quite a mess in there, the symbols I exposed are :

Node type approach

The issue here is that all the standard block types have specific cmark_node_types enumerated, whereas my current approach defines a single block type for all the new types of blocks defined by extensions, CMARK_NODE_EXTENSION_BLOCK. While I think this is the correct approach, I also think my solution for allowing an extension to discriminate between various extension blocks is very suboptimal, as I simply assign a pretty_name to all these new block types.

I think a better solution would be along the lines of the glib type system https://developer.gnome.org/gobject/unstable/gobject-Type-Information.html , which permits dynamic registration of types at run time, however I don't think we would want full-fledged objects for each block, and I'm not sure registering block types as "fundamental types" would make sense either, so that's still a pretty open question on my end.

As can_contain is called when adding a child to another, be it using cmark_parser_add_child or cmark_node_append_child, I had to separate creation of an open block (make_block) from add_child, in order to let the extension set the pretty name of the created extension block, I think it would be good to share more code between the parser and the nodes here, allowing to use append_child and obtain an open block or something, really didn't dig into the code further.

Extension discovery

For now, the extensions are discovered and stored in a simple list , a GHashTable would be more appropriate but it's not really critical.

The user can then attach extensions to a given parser with "cmark_parser_register_extension", the name is probably off but the approach seems pretty sane to me.

I've added some command-line switches to the cmark executable itself, once again naming might be off but the solution is pretty straightforward, one can list extensions and define zero or more extensions to be used for a given run.

In the future, I think a nice feature would be to let commonmark documents define which "extensions" they need to be parsed correctly, I think there's some discussion about this on talk.commonmark and I haven't given the issue more thought though.

The extension structure itself

While making blocks gobjects would IMO be pretty bad performance-wise, having extensions (and other object-like structures) derive from GObject would be my natural inclination if it was available, as I exposed what amounts to virtual methods on these.

I've documented the header file for the extensions at https://github.com/MathieuDuponchelle/cmark/blob/extensions/src/extension.h , and while incomplete (a virtual method to let the extension inform cmark of whether a block can "accept lines" would be needed), I think the interface is OKish.

I pass the parent_container to these functions to allow for "vertical rules" a la setext heading, the table extension makes use of it to requalify a paragraph as a table header, it is not enough to let the extension create new blocks, it also needs to be able to modify previously open ones.

It is for now up to the extension to advance the offset of the parser, I don't know if this is too-level or not.

Rendering

My approach is to define a single render_block virtual method, which passes a "format" argument to the extension. If it doesn't handle the given format, it should return false and the renderer should implement a fallback.

The issue I can think of is that for now, the internal state of the renderer is absolutely not taken into account, I don't have a good example for where this could be an issue yet though, but a solution would be to add an extra argument to that function, carrying that state over somehow.

Extension block

I tried to keep it as small as possible, the responsible extension and custom_data pointers seem mandatory to me.

This is pretty much all I can think of for now

Extensions: implement an example "core extension" (https://github.com/MathieuDuponchelle/cmark/commit/ea58c70d1be5412bcb56464cac5dca9f178d3d82)

The extension is responsible for creating its own scanner, I used re2c for that purpose as it was available, nothing should prevent an implementation to use flex or X as a scanner.

I'm not particularly good at writing regular expressions, and the ones I did write are certainly full of holes, not really the issue here but feel free to correct me :)

As exposed in "Node type approach" , the code is full of ugly strcmp to implement the "grammar" of a piped table (tables must start with a header, rows can contain cells etc).

Other than that, not much to say, I think the size of the code is not unreasonable and there's not that many low-level internals exposed.

Cheers!

nwellnhof commented 8 years ago

Just a couple of notes and opinions from my side.

The code for extensions should live in separate repositories, not within the cmark repository. This requires that extensions use their own build system.

Loading extensions via dlopen and dlsym is the right approach. But note that these functions aren't portable.

Most people embedding libcmark in other projects will be strongly against a runtime dependency like GLib (me included), especially regarding portability. There's really no need for that. Extensions can simply be based on a table of function pointers like in your draft. There's also no need for a hash table to store extensions. A linked list is perfectly fine for a small number of entries (maybe 1-3 in the typical case). Extensions should add no overhead or dependencies for people who don't need them.

The string APIs (buffer and maybe chunk) should probably be exposed to extension authors. ("Chunks" are basically an immutable string type that can point to string buffers it doesn't own.)

Regarding can_contain, I think a simple approach like we have for "custom" nodes might be better. For example: Every node that can contain paragraphs can also contain extension blocks. Extension blocks can contain any node type except document nodes.

The hardest problem is how to integrate extensions with the block parsing algorithm. I don't know enough about these internals, so I can't comment on that.

If you make internal functions like S_advance_offset or add_child public, don't change them throughout the cmark codebase, but simply add a single new function like:

cmark_node *cmark_parser_add_child(...) {
    add_child(...);
}

This reduces the size of diffs considerably.

MathieuDuponchelle commented 8 years ago

The code for extensions should live in separate repositories, not within the cmark repository. This requires that extensions use their own build system.

Possibly, I made this to ease testing but I have no strong opinions either way, a project like gstreamer does ship some "core plugins" (https://cgit.freedesktop.org/gstreamer/gstreamer/tree/plugins) for example, but the difference is pretty clear-cut there as a "core plugin" has to be "media-agnostic" and high quality to be included there, it does provide with an official example on how to implement a plugin.

Loading extensions via dlopen and dlsym is the right approach. But note that these functions aren't portable.

I'm aware of that :) Note that https://developer.gnome.org/glib/stable/glib-Dynamic-Loading-of-Modules.html would take care of portability.

Most people embedding libcmark in other projects will be strongly against a runtime dependency like GLib (me included), especially regarding portability. There's really no need for that. Extensions can simply be based on a table of function pointers like in your draft. There's also no need for a hash table to store extensions. A linked list is perfectly fine for a small number of entries (maybe 1-3 in the typical case). Extensions should add no overhead or dependencies for people who don't need them.

GLib is extremely portable. With respect to "embedding", bundling software is a poor practice and shouldn't be a criteria here. Depending on the GLib would make the project way more maintainable, reducing the amount of code and thus the surface for bugs.

Can you please be specific about your statement on portability?

The string APIs (buffer and maybe chunk) should probably be exposed to extension authors. ("Chunks" are basically an immutable string type that can point to string buffers it doesn't own.)

ack

Regarding can_contain, I think a simple approach like we have for "custom" nodes might be better. For example: Every node that can contain paragraphs can also contain extension blocks. Extension blocks can contain any node type except document nodes.

I don't agree with that, allowing extensions to specify which nodes the block types they define can and cannot contain seems extremely important to me: we want users of the node API to be prevented from adding "tables" to "table cells" for example.

The hardest problem is how to integrate extensions with the block parsing algorithm. I don't know enough about these internals, so I can't comment on that.

I have worked on https://github.com/jgm/cmark/pull/103 and https://github.com/jgm/cmark/pull/104 to gain some insight on the parsing algorithm, as this was indeed the actually hard problem to solve. I think my current solution is pretty adequate, a good exercise to ascertain that would be to try and implement all the core block types as extensions.

If you make internal functions like S_advance_offset or add_child public, don't change them throughout the cmark codebase, but simply add a single new function like:

cmark_node *cmark_parser_add_child(...) { add_child(...); }

This reduces the size of diffs considerably.

Indeed, that was dumb of me, I'll rework the patch tomorrow :)

Thanks for the feedback, cheers

jgm commented 8 years ago

I'm pretty swamped at the moment, so I only had time for a brief glance, but this looks promising to me!

I agree with @nwellnhof about avoiding a glib dependency.

One thing I'm not sure about is the inclusion of "rendering" in an extension. cmark was designed from the beginning to support multiple output formats. The way you currently have it set up, an extension defines HTML rendering, but for other formats you're out of luck. Have you thought about how this could/should be handled?

MathieuDuponchelle commented 8 years ago

I'm pretty swamped at the moment, so I only had time for a brief glance, but this looks promising to me!

Cool :)

I agree with @nwellnhof about avoiding a glib dependency.

So be it, even though more specific reasons would be nice :) This means that we'll have to figure out and implement a system to have dynamic loading work on windows too.

One thing I'm not sure about is the inclusion of "rendering" in an extension. cmark was designed from the beginning to support multiple output formats. The way you currently have it set up, an extension defines HTML rendering, but for other formats you're out of luck. Have you thought about how this could/should be handled?

That was the "rendering" section of my admittedly long summary previously. There aren't really that many solutions out of this. The solution I did adopt was the pragmatic one, which is :

Note that I only implemented the call to the render_block virtual method in the html renderer for now, but all the renderers should indeed implement this as well.

I chose this solution because it is the most flexible one, the only other solution would be to have something like pandoc, where for example commonmark's core would define block types for a table, even though it recognizes no syntax to create one "natively". Extensions would then only have to create "conforming" tables, and the renderers would know how to render them natively.

The advantages of this approach are that:

The disadvantages are :

One could imagine a mix of these two solutions maybe, at that point I think it needs to be your call as it's a pretty fundamental design decision, which will be important for the future of this API.

MathieuDuponchelle commented 8 years ago

Another thing I'd like to mention is that I did that initial work in part to just get the ball rolling, as this is something I consider as very important to have as soon as possible, if this triggers the interest of someone, by all means grab my branch and continue the work, just let me know about your branch here so I can cherry-pick stuff / rebase on it.

nwellnhof commented 8 years ago

On 26/02/2016 02:07, Mathieu Duponchelle wrote:

I'm aware of that :) Note that https://developer.gnome.org/glib/stable/glib-Dynamic-Loading-of-Modules.html would take care of portability.

It shouldn't be too hard to support dynamic loading on Windows. I can certainly help if we come to this point.

GLib is extremely portable. With respect to "embedding", bundling software is a poor practice and shouldn't be a criteria here. Depending on the GLib would make the project way more maintainable, reducing the amount of code and thus the surface for bugs.

Can you please be specific about your statement on portability?

It seems that GLib is portable enough, but on a platform like Windows it's not so easy to set up as it is on Linux, where you can just "apt-get libglib". You also have to think about GLib's dependencies. From a quick look, at least zlib and libFFI are hard requirements. On platforms like Windows you also need libintl. It's quite a hassle to compile all of that from scratch, especially on more exotic platforms.

If you want to build cmark with MSVC for example, all you need is the Windows platform SDK and CMake and you'll get a self-contained 300K binary by running a single command. (Personally, I'd even like to get rid of the CMake dependency.)

Also, you can't make a blanket statement that bundling software is poor practice. Embedding a small library like libcmark can be perfectly reasonable, especially given its unstable API. We do that for Apache Clownfish and Apple does the same with Swift.

MathieuDuponchelle commented 8 years ago

It shouldn't be too hard to support dynamic loading on Windows. I can certainly help if we come to this point.

:D

It seems that GLib is portable enough, but on a platform like Windows it's not so easy to set up as it is on Linux, where you can just "apt-get libglib". You also have to think about GLib's dependencies. From a quick look, at least zlib and libFFI are hard requirements. On platforms like Windows you also need libintl. It's quite a hassle to compile all of that from scratch, especially on more exotic platforms.

You don't really need to do that, binaries are available for windows, see http://ftp.gnome.org/pub/gnome/binaries/win32/glib/2.8/ for example. https://developer.gnome.org/glib/stable/glib-building.html lists the dependencies needed if you wish to compile the glib, I don't see the zlib or libFFI in there. AFAIK, libFFI is needed for the "gobject-introspection" layer, if you wish to expose your C API to other languages.

If you want to build cmark with MSVC for example, all you need is the Windows platform SDK and CMake and you'll get a self-contained 300K binary by running a single command. (Personally, I'd even like to get rid of the CMake dependency.)

+1 for getting rid of cmake

Also, you can't make a blanket statement that bundling software is poor practice. Embedding a small library like libcmark can be perfectly reasonable, especially given its unstable API. We do that for Apache Clownfish and Apple does the same with Swift.

Bundling libraries does present known security problems, see https://www.debian.org/doc/debian-policy/ch-source.html#s-embeddedfiles or https://fedoraproject.org/wiki/Archive:Bundled_Library_Packaging_Draft#Why_no_Bundled_Libraries for more opinions on that :)

kainjow commented 8 years ago

Dropping CMake has already been brought up at #69

kainjow commented 8 years ago

Also, if you want to add glib as a dependency, please make it optional. The beauty of cmark is it has no dependencies, so it makes it easier to integrate into just about anything.

MathieuDuponchelle commented 8 years ago

@kainjow , I suspect by "integrate" you really mean "bundle" :)

MathieuDuponchelle commented 8 years ago

Anyway I don't want this to devolve in a discussion about the pertinence of using the GLib to simplify and improve implementation tbh, I purposefully avoided adding that dependency while writing this code for that reason :)

MathieuDuponchelle commented 8 years ago

I will update the first draft now, @nwellnhof if you feel like improving the plugin discovery to work on Windows as well that would be awesome, I don't have a Windows handy :)

MathieuDuponchelle commented 8 years ago

The string APIs (buffer and maybe chunk) should probably be exposed to extension authors. ("Chunks" are basically an immutable string type that can point to string buffers it doesn't own.)

@nwellnhof , I'm not exactly sure how to best go about the chunk API to be honest, it is entirely composed of "forced-inline" functions and I have no idea whether exposing such functions as API is a good practice, any clues ?

Regarding strbuf, I've made some preliminary work to expose it, if you feel like having a look I've got a branch at https://github.com/MathieuDuponchelle/cmark/tree/expose_new_API

Cheers!

nwellnhof commented 8 years ago

On 27/02/2016 05:52, Mathieu Duponchelle wrote:

@nwellnhof https://github.com/nwellnhof , I'm not exactly sure how to best go about the chunk API to be honest, it is entirely composed of "forced-inline" functions and I have no idea whether exposing such functions as API is a good practice, any clues ?

Yes, exposing these inline functions would be bad practice. Not the inline functions per se, but we'd have to expose the chunk struct.

The major problem with the chunk API is that it makes no guarantees how long its internal pointer is valid. I'd really prefer to keep this internal. What are the use cases where extensions need to work with chunks?

Regarding strbuf, I've made some preliminary work to expose it, if you feel like having a look I've got a branch at https://github.com/MathieuDuponchelle/cmark/tree/expose_new_API

See my inline comments (mostly documentation).

We'll also have to add a real constructor and destructor for API users.

nwellnhof commented 8 years ago

I had a quick look at your sample extension and it seems that there's no need to expose cmark_chunk. Can't you simply change OpenBlockFunc and MatchBlockFunc to take a char *buf and bufsize_t size instead of a cmark_chunk *?

MathieuDuponchelle commented 8 years ago

@nwellnhof I addressed your review in separate commits on that branch to preserve your comments, let me know if you're happy with the changes and I'll submit a pull request with a fixed up commit, I'd rather have such changes be merged ASAP to minimize chances of conflict.

Thanks for your time, I'll answer your comments here a bit later.

MathieuDuponchelle commented 8 years ago

Additionally, I added a "cmark_strbuf_get" API in https://github.com/MathieuDuponchelle/cmark/commit/33f209fccaf7fcff07f09abe0d3ac0aca7987c65

MathieuDuponchelle commented 8 years ago

@nwellnhof I've just looked back on the extension, particularly _scan_at which is a strict copy paste from src/scanners.re, and indeed there is no reason to pass a chunk in the extension scanner at all. Note that this is also the case in src/scanners.re, but that doesn't really matter there.

I'm toying with the idea of cleaning up my patch completely (apart from the plugin loading abstraction), then try to reimplement all the block parsing in core as "extensions". This would finish validating the API, and would be a conceptually satisfying thing to do, but I'm unsure about the performance impact it may have. We'll see :)

MathieuDuponchelle commented 8 years ago

I pushed a bunch of commits:

The two issues that I still see are plugin loading on Windows, which isn't a design issue but needs to be done, and more importantly the decision of whether the current design where extensions implement and render their own block types is the correct one.

The more I think about it, the more I feel like the alternative design, where libcmark would define block types (CMARK_NODE_TABLE, CMARK_NODE_WHATEVER), even though there isn't (yet) any standard syntax rules in the specification that can lead to their creation, would be the superior one.

The major issue with my current design is really the interaction of multiple extensions, I can imagine all sorts of issues arising from the fact that an extension is unable to tell whether a node type it implements can contain or be contained by a node type implemented by another extension, and after thinking long and hard about it, I honestly think there's no general solution to this issue, and I'm not interested in a solution that "mostly works", as such solutions are already achievable by parsing and modifying the ast in a second phase.

I'll wait for feedback about this for the while :)

jgm commented 8 years ago

I haven't looked at your table parsing strategy in detail, but you should have a look at this thread:

http://talk.commonmark.org/t/parsing-strategy-for-tables/2027/4

jgm commented 8 years ago

The more I think about it, the more I feel like the alternative design, where libcmark would define block types (CMARK_NODE_TABLE, CMARK_NODE_WHATEVER), even though there isn't (yet) any standard syntax rules in the specification that can lead to their creation, would be the superior one.

Two worries about this: (a) it limits extensions to things that can be anticipated ahead of time, and (b) it seems awkward to have some officially defined node types that aren't handled by the library's own parsers or renderers.

The major issue with my current design is really the interaction of multiple extensions, I can imagine all sorts of issues arising from the fact that an extension is unable to tell whether a node type it implements can contain or be contained by a node type implemented by another extension, and after thinking long and hard about it, I honestly think there's no general solution to this issue, and I'm not interested in a solution that "mostly works", as such solutions are already achievable by parsing and modifying the ast in a second phase.

Suppose we introduce the notion of a "regular block element" and "regular block container." A regular block container can contain any regular block element. Here a TABLE would be a regular block element, and a CELL would be a regular block container. So, if someone introduces an extension with a BOOZUM node, then it can contain a TABLE if it is designated as a regular block container, and it can be contained by a CELL if it is designated as a regular block element.

If a block is not a regular block container, then the extension must specify exhaustively what it can contain. Thus, for example, a LIST can only contain ITEMs. If someone adds BOOZUM, then a LIST cannot contain a BOOZUM. A ROW can only contain a CELL, and a TABLE can only contain a ROW (to simplify a bit).

MathieuDuponchelle commented 8 years ago

Hey, I pushed a third branch that follows the alternative design I exposed here :

https://github.com/MathieuDuponchelle/cmark/commits/extensions_draft_3

I think this approach should be considered very seriously, the drastically simplified implementation is to me a sign of things being right.

Conceptually this makes syntax extensions (I renamed the structure) simple "syntax rules providers", which allows to continue strictly enforcing a correct node hierarchy in the output.

I must admit I'm not sure I fully understood your solution, from what I understand it still leaves room for "impossible things" to happen, generic classes of containers cannot be guaranteed to always be containable one in another. I must also admit I'm now a bit tired and may need more detailed examples to see the light. Also note that "syntax rule providers" can always resort to creating custom blocks, which already exist and afaiu overlap pretty significantly with what you're proposing, maybe a way forward could be to extend their API, but I'm not convinced.

The advantage of the approach I proposed is that while the specification of the input CommonMark format is (rightfully so), a slow process, with the goal of avoiding "syntax debt" as much as possible, there is no such limitation with respect to the output of libcmark: the formats which it outputs are well-defined, and as such we shouldn't be overly afraid of "committing" ourselves here, there aren't 50 correct ways to output a html table.

I initially also perceived the idea of defining node types with no standard syntax rules to allow creating them as awkward, but then I looked at it from another angle, if you consider these node types as "nice things we will want to have in the future", then they're not awkward, as they match the specification process, which is to state a desired output, then work on the best syntax to produce it, and they can actually be a valuable help in the process of defining the best syntax without committing to something immediately.

I'll look at the table parsing strategy link you shared, thanks, note that as I said initially my goal with this extension for now is simply to validate the API, not figure out the perfect syntax / parsing strategy for now :)

MathieuDuponchelle commented 8 years ago

@jgm , I came up with a silly idea for the table parsing problem exposed at http://talk.commonmark.org/t/parsing-strategy-for-tables/2027/5 , I'm pretty curious to know what you think of it :)

MathieuDuponchelle commented 8 years ago

After thinking back on it, not so silly actually ^^

MathieuDuponchelle commented 8 years ago

@jgm , found time to think about this ?

MathieuDuponchelle commented 8 years ago

I'm adding inline syntax extensions along the same lines, my test case will be strikethrough with:

I will ~strike that through~

This will allow me to validate the interface with the most complex aspect of inline parsing, the delimiter stack logic.

My intention for the short term is to bundle (I know :) my branch of libcmark in https://github.com/hotdoc/hotdoc , I will extend inline parsing to support syntax which by definition shouldn't make it in the specification, gtk-doc's syntax, this should help validate the API further. I of course don't want to bundle a fork of cmark in the long-term, that would be silly :)

MathieuDuponchelle commented 8 years ago

And I just pushed a new set of commits to implement initial inline parsing extensibility, still on https://github.com/MathieuDuponchelle/cmark/commits/extensions_draft_3. Some other methods of the "subject" will need to be exposed for full extension functionality, but the current state is sufficient to parse strikethrough tags in an appropriate way (through the delimiter stack process).