I'm using marked as a tokenizer to convert raw Markdown into tokens, and I’ve developed my own parser (based on ProseMirror’s official default Markdown parser) to transform these tokens into ProseMirror-compatible document nodes.
Background
My parser is closely modeled on the code provided by prosemirror-markdown. For example, similar to openMark() and closeMark() from prosemirror-markdown, I use the following methods to handle mark tokens:
public activateMark(mark: ProseMark): void {
const active = this.__getActive();
active.marks = mark.addToSet(active.marks);
}
public deactivateMark(mark: ProseMarkType): void {
const active = this.__getActive();
active.marks = mark.removeFromSet(active.marks);
}
The Issue (Corner Case)
Consider this Markdown input: *This is *italic* text*.
The tokenized result from marked looks like this:
Paragraph [block]
Em [inline]
Text: "This is" [inline]
Em: "italic" [inline]
Text: "italic" [inline]
Text: "text" [inline]
In the parser:
Each time a mark is activated (activateMark), it adds a mark to the active node’s markSet (in this case, the active node is the paragraph).
Each time a mark is closed (deactivateMark), the corresponding mark is removed from the markSet of the active node.
The Problem
When activating the same type of mark consecutively (like two em marks in this case), only one instance of the mark is added to the markSet. As a result, two activateMark calls will still leave just oneem mark in the markSet.
However, when deactivateMark is called twice (once for each nested em), the first deactivateMark removes the single em from the markSet, and the second deactivateMark is effectively removing a non-existent em mark.
Analysis of the Corner Case
In the nested case *This is *italic* text*, here’s how the bug manifests:
The first activateMark for *This is *italic* text* adds an em mark to the markSet.
The second activateMark for *italic* doesn’t add a second em mark because the markSet can only hold one instance of the same mark type.
When the first deactivateMark for *italic* is called, it removes an em mark from the paragraph.
When creating a text node with the text "text", it obtains the markSet from the current active node (the paragraph), but the first deactivateMark has already removed the em. As a result, the "text" part of the string no longer has an em mark, even though it should.
When the second deactivateMark for *This is *italic* text* is called, it tries to remove an em mark from the paragraph's markSet, but there is nothing there anymore due to step 3.
Conclusion
Due to this issue, the final part "text" in the tokenized result incorrectly lacks the em mark. The problem arises because the parser is not handling the nested activation and deactivation of the same mark type correctly.
Request and Question
I am aware that there are several ways to solve this issue without modifying ProseMirror’s core code, but I think it is less elegant coding and a little bit messy in terms of coding style.
Introduction
I'm using
marked
as a tokenizer to convert raw Markdown into tokens, and I’ve developed my own parser (based on ProseMirror’s official default Markdown parser) to transform these tokens into ProseMirror-compatible document nodes.Background
My parser is closely modeled on the code provided by
prosemirror-markdown
. For example, similar toopenMark()
andcloseMark()
fromprosemirror-markdown
, I use the following methods to handle mark tokens:The Issue (Corner Case)
Consider this Markdown input:
*This is *italic* text*
.The tokenized result from
marked
looks like this:In the parser:
activateMark
), it adds a mark to the active node’smarkSet
(in this case, the active node is the paragraph).deactivateMark
), the corresponding mark is removed from themarkSet
of the active node.The Problem
When activating the same type of mark consecutively (like two
em
marks in this case), only one instance of the mark is added to themarkSet
. As a result, twoactivateMark
calls will still leave just oneem
mark in themarkSet
.However, when
deactivateMark
is called twice (once for each nestedem
), the firstdeactivateMark
removes the singleem
from themarkSet
, and the seconddeactivateMark
is effectively removing a non-existentem
mark.Analysis of the Corner Case
In the nested case
*This is *italic* text*
, here’s how the bug manifests:activateMark
for*This is *italic* text*
adds anem
mark to themarkSet
.activateMark
for*italic*
doesn’t add a secondem
mark because themarkSet
can only hold one instance of the same mark type.deactivateMark
for*italic*
is called, it removes anem
mark from the paragraph.deactivateMark
has already removed theem
. As a result, the "text" part of the string no longer has anem
mark, even though it should.deactivateMark
for*This is *italic* text*
is called, it tries to remove anem
mark from the paragraph's markSet, but there is nothing there anymore due to step 3.Conclusion
Due to this issue, the final part
"text"
in the tokenized result incorrectly lacks theem
mark. The problem arises because the parser is not handling the nested activation and deactivation of the same mark type correctly.Request and Question
I am aware that there are several ways to solve this issue without modifying ProseMirror’s core code, but I think it is less elegant coding and a little bit messy in terms of coding style.