Closed segevfiner closed 1 year ago
this leads to weird side effects on a parse/serialize cycle.
There is no guarantee of round-trips through parsing/serializing Markdown leaving the Markdown text unchanged.
That being said, I guess if you configure markdown-it to parse HTML tags, this can be problematic. But you'll need to configure your MarkdownSerializer
for that anyway, and can use escapeExtraCharacters
to backslash-escape &
and <
.
No need to keep it unchanged of course, but I'm trying not to change its actual meaning when parsed by parsers that do process HTML tags, rather than treat them as plain text. So stuff that was escaped before and needs to be escaped to avoid having special meaning needs to remain escaped.
Have you tried escapeExtraCharacters
? It'll only affect content that was parsed as regular text.
Since the HTML is parsed as regular text, this will mean any HTML tags that were not escaped in the original Markdown will become escaped on output, losing their meaning.
I think we might have to prevent unescaping of those characters at the markdown-it level to preserve them. e.g.
import type MarkdownIt from "markdown-it";
import type StateCore from "markdown-it/lib/rules_core/state_core";
const PRESERVE = new Set(["<", ">", "&", '"', "'"]);
function preserveSpecial(state: StateCore) {
const blockTokens = state.tokens;
for (let j = 0, l = blockTokens.length; j < l; j++) {
if (blockTokens[j].type !== "inline") continue;
// eslint-disable-next-line @typescript-eslint/no-non-null-assertion
const tokens = blockTokens[j].children!;
const max = tokens.length;
for (let curr = 0; curr < max; curr++) {
if (
tokens[curr].type === "text_special" &&
PRESERVE.has(tokens[curr].content)
) {
tokens[curr].type = "text";
tokens[curr].content = tokens[curr].markup;
}
}
}
}
export default function preserveSpecialPlugin(md: MarkdownIt) {
md.core.ruler.before("text_join", "preserve_entity", preserveSpecial);
}
Since the HTML is parsed as regular text, this will mean any HTML tags that were not escaped in the original Markdown will become escaped on output, losing their meaning.
This can be said of any Markdown syntax extension that isn't turned on—that by normalizing the escapes we might be interfering with it. But accommodating for non-enabled features of any kind doesn't seem like a reasonable expectation of a markdown parser or serializer. So I think that adding this kind of extension to your configuration is your responsibility, if you want to treat HTML in this way (don't parse it, but preserve escapes).
markdown-it
currently parses HTML entities into text, this leads to weird side effects on a parse/serialize cycle.<b>Not Bold<b>
will parse into<b>Not Bold</b>
and serialize as<b>Not Bold</b>
, which leads to it being bold in Markdown parsers that do honor HTML (Or a subset thereof).  Foo
parses asFoo
and serializes asFoo
, but will parse back asFoo
, as Markdown trims normal spaces. The same applies to trailing 
, which might be further confused as an hard break.<
into prosemirror, serializes as<
but parses as<
.All of these behaviors can be seen here https://prosemirror.net/examples/markdown/ and switching between Markdown and WYSIWYM.