dart-lang / markdown

A Dart markdown library
https://pub.dev/packages/markdown
BSD 3-Clause "New" or "Revised" License
440 stars 200 forks source link

Make it possible to ignore unbalanced tokens (**, *) #599

Open matthew-carroll opened 3 months ago

matthew-carroll commented 3 months ago

Consider a text segment like **something* - currently, parsing that text with this package yields *something.

I'm using this package to implement Markdown serialization as the user types. In the case of the user typing, the typical industry practice is to ignore non-matching Markdown tokens. Therefore, **something* would remain as-is - the Markdown wouldn't be applied.

Is it possible to tell this package not to apply unbalanced tokens? If that's not possible, can that be added to the syntax options? Perhaps this option should be made available in the constructor for TagSyntax or something like that.

srawlins commented 3 months ago

In the case of the user typing, the typical industry practice is to ignore non-matching Markdown tokens.

Can you cite where this is standard? I'd like to play with it.

Is it possible to tell this package not to apply unbalanced tokens? If that's not possible, can that be added to the syntax options?

Generally, no. We adhere to the CommonMark spec, and match our implementation to that standard. There isn't really a notion of "unbalanced tokens" so I think this would be a sizeable undertaking, to specify a lot of behavior. CommonMark has a notion of "delimiter runs" that are used in some syntaxes like emphasis and strong emphasis. Someone could maybe implement some notion of "unbalanced tokens" in that code.

Otherwise, I think you'd want to take an un-rendered Markdown concrete syntax tree, where you might have enough information to determine that the delimiter runs are unbalanced. But this package doesn't currently export a Markdown syntax tree.

matthew-carroll commented 3 months ago

Can you cite where this is standard? I'd like to play with it.

It's not a standard, but you can go try inline Markdown styles with Notion, Linear, Slack, etc. There's a variety of behavior, but you'll see a number of behaviors that are difficult/impossible to implement with this package.

Generally, no.

Is there another package you would suggest for this purpose? This package seems to have enshrined itself as the go-to place for parsing Markdown in the Dart/Flutter ecosystem. I don't think I'm aware of meaningful alternative.

Moreover, is the described goal truly outside the mission of this package? I understand that historically this package has been used as a batch parser for a blob of Markdown. But in so doing, it enshrines all sorts of syntax and protocol details. It sure seems like a waste to go build a new package and re-invent all of that just to be able to apply inline Markdown the way numerous products do today. Can this package introduce a second top-level parser that uses the existing internals but is designed for use on text as the user types it? That wouldn't need to mess with the batch parser.

srawlins commented 3 months ago

Notion, Linear, Slack, etc.

Hmm, I don't seem to be able to try any of these without signing up.

Is there another package you would suggest for this purpose?

No, afaik this is the best-supported markdown parser/renderer package in Dart.

Moreover, is the described goal truly outside the mission of this package?

No I don't think so. I think that using the package more programatically, like getting a Markdown syntax tree, is well within the scope of this package. We just don't have that feature yet. There is the flutter_markdown package which I think treats this package's output, a tree of HTML nodes, as if it were a tree of Markdown nodes. I haven't looked at the code, but I have to imagine this would be error prone, or a real pain to implement.

If you had access to the Markdown tree, you could maybe take Hello **Goodbye*, and look for nodes in the tree like:

And say, "Ah there is an Emphasis node that follows Text with a "*"; that should disappear. Or, I guess your request from the top is that these should be treated as two Text nodes, with the delimiters put back where they were:

Can this package introduce a second top-level parser that uses the existing internals but is designed for use on text as the user types it? That wouldn't need to mess with the batch parser.

I'm not sure. I don't have a sense of what output you would want. I think it would need a lot of specification in order to see how you'd implement it. The CommonMark spec says that Hello **Goodbye* is perfectly legal markdown text. There is no sense of an error, or recovery. If you want to write "Hello " followed by "Goodbye*" in italic, this markdown text is precisely how you'd do it. 🤷

But maybe the CommonMark examples can give us examples of what you're going for, in a "user is typing" mode. I can look at adjacent examples of "This text would render as this syntax, but this text would not." Like for ATX headings, examples 71, 72, and 73 show that you can include a closing sequence of delimiters, like ## foo ##. Then example 75 shows that the closing delimiter must be preceded by one whitespace, so in # foo#, you don't render a heading at all. CommonMark specifies this is rendered as <h1>foo#</h1>. It sounds like you would rather not render as a heading, and just render # foo# as text, in a "user is typing" mode.

Or example 79 shows that an empty ATX heading, like ## should be rendered as an empty heading, like <h2></h2>. You might want this to instead render as ## in a "user is typing" mode? There are a lot of considerations, and I think a lot of details and case-by-case rules, for what you're going for.

matthew-carroll commented 3 months ago

I do want to clarify that there are definitely a variety of legal Markdown tokens that are ignored under certain circumstances by these various apps. So the parsing goals that I'm implementing aren't really about legal vs erroneous syntax. Instead, it's about the UX of typing Markdown as you go.

UX considerations

The fully isolated style cases are handled as expected in other apps, e.g., **bold** and *italics*. But here are some as-you-type Markdown serializations that I've observed in Notion and Linear.

"**this*" -> no style is applied

"**this* and *that*" -> changes to "**this* and that" with "that" in italics

Then, taking the above "**this* and that with italics, and typing another two trailing "*" results in: `"this and that"` with the whole thing bold and "that" still in italics.

From a holistic parsing perspective, these examples likely seem strange. Sometimes a legal syntax is applied and other times it's not. But when you're the person typing the syntax the desired rules are a bit different. For example, as I mentioned in the original post, when you're typing out the characters "**bold**" you must first type "**bold*" which would greedily turn "bold" into italics and lop off a pair of "*", which the user doesn't want.

Performance

Performance may also be noteworthy here. Given that this parsing is taking place as the user types, it's probably not possible to re-parse the whole document on every edit. For this reason, for example, I'm only implementing recognition of Markdown syntax within a single paragraph/node. I'm not considering something like bold "**" spanning across paragraphs (I do apply bold across paragraphs when full deserializing a document - but I don't look for it as the user types).

Second, even a single paragraph might be quite long, and might include a number of other styles and perhaps inline widget content. It may not be acceptable to re-parse even a full paragraph on every key stroke. But if a parser begins at a reported caret position and then only considers a closing style token immediately upstream from the caret, such as "bold*|", "italics|", "strikethrough~|", then the parser can quickly bail in most cases, and even in the nominal Markdown syntax case, the parser won't consume more than a dozen characters in the typical case.

Multiple Parsers

Given that these rules are not about legal vs erroneous Markdown syntax, it's very possible that apps will want different rules. One way to handle that in this package would be to build a few different parsers with different policies. Or, something like an UpstreamMarkdownParser could be introduced with appropriate hooks so that each app can configure it to reflect their desired rules.

As the User Types

To be clear about what I mean with "as the user types", I just mean a policy that understands a caret position within the text. It essentially means "hey parser, look upstream from offset X". So, to be clear, there's no suggestion in this proposal that this package have any knowledge of editing systems, such as the IME. It's just about who/when/where the parser does its work.