Finding the character range for a given node?

keehun commented 3 years ago

I am quite new to the cmark community, but after initially digging around, it doesn't seem like cmark nodes have a reference to what "character ranges" from the input text was responsible for the node.

For example, if I have this input string, Hello **world**, I'd want "range: 6...13" attached to the Strong node.

One use for this information would be to be able to "descriptively" parse markdown without rendering it into another form in a lossy process. For example, maybe I want to just get the Strong nodes and apply a particular style onto the markdown source.

My first approach was to take the nodes and "reconstruct" the markdown source from them, but this process is not robust and has too much room for error. The loss of the markdown formatting characters is too much. If I could get the range for a node, I could keep the original markdown source and decorate that in an additive method.

Is this something that is even feasible based on the architecture of cmark? I intend on continuing to look for myself, but I figured I'd ask in case anyone has thought about this already.

Thank you

jgm commented 3 years ago

A cmark_node has fields start_line, start_column, end_line, end_column.

jgm commented 3 years ago

Be sure to set CMARK_OPT_SOURCEPOS in options when parsing.

keehun commented 3 years ago

Thank you. I totally missed that!

Am I reading it right that the column it's giving are the number of bytes it has read on that line? For example, for a line containing # 👨‍👩‍👧‍👧, instead of giving start column of 1 and end column of 3, it gives start column 1 and end column 27 which is exactly the number of bytes for # 👨‍👩‍👧‍👧.

I'm guessing there's no recognition of grapheme clusters and a configuration to count the number of "characters" which may span many bytes?

jgm commented 3 years ago

I think you're right that it is counting bytes. I haven't looked at the code for some time. Obviously, this isn't ideal for all purposes, but it uses a number cmark has to keep track of anyway.

keehun commented 3 years ago

The more I think about it, the more it makes sense that cmark remains neutral to the different ways that different languages (and their standard libraries) count the "length" of a string. Bytes is the most fundamental/"unbiased" measurement. It just makes my job a little bit harder 😅

commonmark / cmark

Finding the character range for a given node? #375