commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.6k stars 527 forks source link

Introduce `cmake_node_convert_type` #522

Closed Ericson2314 closed 5 months ago

Ericson2314 commented 5 months ago

This is useful when constructing ASTs that contain a mix of computed and parsed data. "sub documents" that are parsed have a single root document node, and rather than having to extract its children, those can be placed as-is in a larger parse tree.

Ericson2314 commented 5 months ago

(I opened a thread asking about this at https://talk.commonmark.org/t/parsing-document-and-nesting-in-larger-ast/4583/2 before I decided to just make it myself.)

jgm commented 5 months ago

@nwellnhof do you have an opinion about this?

nwellnhof commented 5 months ago

I wouldn't want to allow violations of the tree structure just for convenience. Conceptually, a document node should always be the root node.

rather than having to extract its children

This doesn't seem like a good reason. You can simply write a helper function to extract children, or track subtrees in your own data structure, or use the "custom" node type or whatever.

Ericson2314 commented 5 months ago

Well, one way to deal with things conceptually is to rename the document node to "no op" node or something, and then it makes sense that it can appear anywhere.

If that is not desirable, then I would like to go with the custom node solution by being able to convert an existing document node to a root node. I don't like having to extra all the children for what should be a simple append-child embed, but converting the node first and them embedding it feels fine.

(FWIW I come from functional programming land where having special root notes is considered somewhat suspicious. Usually any type of node in a tree can appear anywhere in that tree, and if there has to be something special on the outside, it is a different type, not just a different data constructor. @jgm should know what I am talking about :) )

nwellnhof commented 5 months ago

I don't like having to extra all the children for what should be a simple append-child embed

But why? You can do that with a helper function in ~10 lines of C.

Ericson2314 commented 5 months ago

@nwellnhof Because I believe embedding a subtree should be an O(1) operation.

nwellnhof commented 5 months ago

Typically, you only append a subtree once and the result has to be processed at some point. So it doesn't matter whether appending is O(1) or O(n).

If you really need to append subtrees more than once and performance is important, you can move its children into a CMARK_NODE_CUSTOM_BLOCK. The conversion is O(n) where n is the number of top-level children, but it only has to be done once per subtree.

Ericson2314 commented 5 months ago

Whether or not there a practical problem for O(n) vs O(1) here, I still fine making embedding a subtree more complicated than it needs to be quite ugly.

I looked at https://github.com/commonmark/cmark/blob/76cbc2d9162c1154327a746b8ad0d3f1a72b1b30/src/node.h#L56-L85 and it doesn't seem like there is any information that would special about the root node. What makes you want to have a special root node none, and not a no-op node that is allowed to appear anywhere?

nwellnhof commented 5 months ago

What makes you want to have a special root node

It simply turned out this way and nobody complained in the last ~10 years. All DOM-like APIs that I know have a special root node.

But libcmark really is different in this regard. The document node doesn't hold any additional data. From a quick look, it almost is a "no-op" node. The only exception seems to be the XML serializer.

I still wouldn't change the API just to cater to the preferences of a single user. If you really want to push this through, please make sure that the serializers continue to work as expected and add some tests in api_test.

Ericson2314 commented 5 months ago

I still wouldn't change the API just to cater to the preferences of a single user. If you really want to push this through, please make sure that the serializers continue to work as expected and add some tests in api_test.

So just to be clear, if all these things happens:

  1. Rename node, but keeping CMARK_NODE_DOCUMENT as a compat alias
  2. Fixing xml parser (I would just move the <xml ...>...</xml> outside processing any node.)
  3. Testing parsed document embedding and serialization of the result
  4. Someone else chimes in that this is useful to them

Then you would be willing to reconsider? Sounds good to me! I'll gladly take a stab at 1-3.

jgm commented 5 months ago

But libcmark really is different in this regard. The document node doesn't hold any additional data. From a quick look, it almost is a "no-op" node. The only exception seems to be the XML serializer.

Of course, it would be quite reasonable to include different data in the document node (e.g. metadata), and maybe we'd want to do this in some future iteration.

So I'd be against renaming this as a NOOP node. (Speaking of functional programming, Pandoc's document model has a distinguished top-level node. And as @nwellnhof says, it's quite normal for there to be a distinguished top-level node in hierarchical models of documents.)

I'd say: just go with the 10 lines of C and be done with it. The alternative is a number of changes to this project that will take work, might have unanticipated consequences, could constrain future changes to the document model, and don't have a large benefit.

Ericson2314 commented 5 months ago

@jgm Oh, oops, I just pushed the the other changes before seeing this.

Of course, it would be quite reasonable to include different data in the document node (e.g. metadata), and maybe we'd want to do this in some future iteration.

So I'd be against renaming this as a NOOP node. (Speaking of functional programming, Pandoc's document model has a distinguished top-level node. And as @nwellnhof says, it's quite normal for there to be a distinguished top-level node in hierarchical models of documents.)

Well that is why I said

and if there has to be something special on the outside, it is a different type, not just a different data constructor.

Per https://github.com/jgm/pandoc-types/blob/master/src/Text/Pandoc/Definition.hs#L274 Pandoc has 3 separate types Pandoc, Block, and Inline. And if one wants to pull out the [Block] from a top-level Pandoc and plop them somewhere else, that is a nice easy O(1) operation.

C having just a single "node" type and not those 3 is understandable, but it does leave a bit ambiguous "types" vs "data constructors".

If you think the right way to think of the C node type is the union of all 3 conceptual types, and so the doc node has invariants just the same way as the "block" and "inline" nodes do, then I would suggest that I contribute a function to convert a documentation node to one of the other types of nodes that also is allowed to have block children. This keeps the operation O(1) and easy, whereas with the existing interface it has to be O(n).

Ericson2314 commented 5 months ago

(In addition to a cmark_node_new, it might also be nice to have a cmark_node_concat, for similar reasons of combining two doubly-linked lists in O(1) rather than O(n). It can move children from one node to the other so we don't have to worry about combining metadata / freeing nodes / etc..)

jgm commented 5 months ago

How theoretical is this concern about O(1) vs O(N)? Is this actually a performance problem in any real-world scenario?

Ericson2314 commented 5 months ago

Entirely theoretical, I freely admit. It's all aesthetics for me about making things that can be simple and cheap actually be simple and cheap.

jgm commented 5 months ago

I'm just not convinced it's worth it. Any changes of this scale have unanticipated consequences and may produce unanticipated bugs or constraint future development, and this software has a lot of users. So there's a high bar for making changes, and I don't think considerations of aesthetics are going to get you across it.

Ericson2314 commented 5 months ago

I hope the alternative plan of just having the new operations is a little easier to reason about, since there are no data structure invariant changes?

Also, I just realized actually the concat/move operator is enough, because one can create a new node, move all the children, and then delete the old new, so we don't need to worry about how converting the type effects the union fields.

This last observation makes me quite pleased that new version should be demonstrably safe ;).

Ericson2314 commented 5 months ago

Ah no, my concat thing isn't so special because updating the parent points is still O(n) :(. Still can do the node conversion however.

Ericson2314 commented 5 months ago

OK I pushed the version.

The first two commits are pure refactors, factoring out new private static functions, and does not change the interface.

The second commit introduces cmake_node_convert_type, which is fairly small thanks to the two new private static functions.

The 3rd commit is the my same subdocument test, but now using cmake_node_convert_type instead.

Ericson2314 commented 5 months ago

OK I am quite happy with the enum node_class from the first commit, which basically codifies the "conceptual types" used to define the "what is allowed to go where" invariants.

There are 4 "classes" (of node types):

and then None + Any adjoined top and bottom for subtyping purposes.

From those classes the correspondence to e.g. Pandoc types should be pretty clear (document root -> Pandoc, true blocks -> Block, inlines -> Inline, list item -> [Block] inside [[Block]] fields).

And finally, those classes make the S_can_contain logic nicely compact, because they work entirely on the level of node type classes, not individual node types.


If this first commit looks safe in isolation, then I think the other three "fall out" very simply.

jgm commented 5 months ago

I have been hinting strongly that I am unlikely to merge this, but let me make that more explicit.

Changing the node type is a tricky operation. Different nodes have different data associated with them, for example. I think you're trying to cover all the bases in your patch, but it ends up being a complex patch that might well introduce bugs. Given that there is no pressing need for this change, I'm inclined to be conservative and not make it.

In addition, I think that there are drawbacks to integrating an existing document by creating a custom block node and putting the document's children there. That creates a distinction between a document created this way and a document that would have been created from scratch, but do we want that distinction?

Ericson2314 commented 5 months ago

I understand that you are leaning against :). I am trying to make new revisions in hopes that they actually address concerns, not wear you down or something.

I think you're trying to cover all the bases in your patch, but it ends up being a complex patch that might well introduce bugs.

What I am trying to do is in the latest versions is have a bunch of no-op refactors such that the marginal complexity of the new function is quite low.

Right now, it is just this:

diff --git a/src/node.c b/src/node.c
index e5c83db..da43a2d 100644
--- a/src/node.c
+++ b/src/node.c
@@ -175,6 +175,23 @@ cmark_node *cmark_node_new(cmark_node_type type) {
   return cmark_node_new_with_mem(type, &DEFAULT_MEM_ALLOCATOR);
 }

+int cmake_node_convert_type(cmark_node *node, cmark_node_type type) {
+  // Only need to validate children if we have any
+  if (node->first_child) {
+    enum node_class
+      old = S_compute_allowed_child(node->type),
+      new = S_compute_allowed_child(type);
+    if (!S_node_class_le(old, new))
+      return 0;
+  }
+
+  // Success path
+  S_cmark_node_deinit(node);
+  node->type = type;
+  S_cmark_node_init(node);
+  return 1;
+}
+
 // Free a cmark_node list and any children.
 static void S_free_nodes(cmark_node *e) {
   cmark_mem *mem = e->mem;

which I think is quite good! (Much better than the version where you last left a comment.)

Different nodes have different data associated with them, for example.

And I tried to only reuse existing allocation / deallocation code (the latter just now in response to https://github.com/commonmark/cmark/pull/522#discussion_r1457841954) to make sure I am not breaking any new ground.

I could not refactor any new code, and just have pure additions. Maybe that would result in a simpler patch, but that would be quite scary to me as now a bunch of logic is duplicated!

If you're really sure there is no possible version of this you would accept. We can just close the PR.

jgm commented 5 months ago

@nwellnhof do you have opinions about whether this should go forward?

Ericson2314 commented 5 months ago

FWIW, I just thought of one more variation, which is a "parse into" parsing variant where instead of the parser creating the root node, it adds children to whatever root node you give it.

With this, node type conversion is not necessary to embed parsed documents in a larger.

jgm commented 5 months ago

The variation is interesting. It would be easy to factor that out of our existing function.

I guess one question is whether there's any utility in being able to reassign block node types besides your original use case. If not, then either your variation or a function that just converts a document into a custom node could do the trick, with less complexity than the above patch.

Ericson2314 commented 5 months ago

I am not sure there is. I realized "document root" and "list item" are both no op nodes. (I think those are extrinsic definitions of a list of nodes based on their location in the AST, rather than intrinsic definition of a "real" node.)

In my test (and my external usage of this) I was converting a doc root to list tem (no op -> no op), but if you don't want do that, e.g. are adding a new section with heading, you are out of luck as no-op nodes don't go elsewhere --- you might have nothing to convert to note to but a custom node.

(Pandoc, by allowing you to alias [Block] separate from any specific type of Block or Pandoc, does have affordances for this. Of course ++ is linear, but one could imagine using a different list type for which is wasn't.)

So given the "intrusive" design of the C where the [Block] is always part of the node, not a freestanding object in its own right, I think the parse-into one is actually more powerful.

I'll close this :) #524 I pushed just before you commented :)