Bistard / nota

📕Next-Gen AI-powered markdown note-taking application.
MIT License
59 stars 5 forks source link

[UI][Editor] Investigation on Nested Marks in ProseMirror #236

Closed Bistard closed 3 weeks ago

Bistard commented 1 month ago

Introduction

I'm using marked as a tokenizer to convert raw Markdown into tokens, and I’ve developed my own parser (based on ProseMirror’s official default Markdown parser) to transform these tokens into ProseMirror-compatible document nodes.

Background

My parser is closely modeled on the code provided by prosemirror-markdown. For example, similar to openMark() and closeMark() from prosemirror-markdown, I use the following methods to handle mark tokens:

public activateMark(mark: ProseMark): void {
    const active = this.__getActive();
    active.marks = mark.addToSet(active.marks);
}

public deactivateMark(mark: ProseMarkType): void {
    const active = this.__getActive();
    active.marks = mark.removeFromSet(active.marks);
}

The Issue (Corner Case)

Consider this Markdown input: *This is *italic* text*.

The tokenized result from marked looks like this:

  1. Paragraph [block]
    1. Em [inline]
      1. Text: "This is" [inline]
      2. Em: "italic" [inline]
        1. Text: "italic" [inline]
      3. Text: "text" [inline]

In the parser:

The Problem

When activating the same type of mark consecutively (like two em marks in this case), only one instance of the mark is added to the markSet. As a result, two activateMark calls will still leave just one em mark in the markSet.

However, when deactivateMark is called twice (once for each nested em), the first deactivateMark removes the single em from the markSet, and the second deactivateMark is effectively removing a non-existent em mark.

Analysis of the Corner Case

In the nested case *This is *italic* text*, here’s how the bug manifests:

  1. The first activateMark for *This is *italic* text* adds an em mark to the markSet.
  2. The second activateMark for *italic* doesn’t add a second em mark because the markSet can only hold one instance of the same mark type.
  3. When the first deactivateMark for *italic* is called, it removes an em mark from the paragraph.
  4. When creating a text node with the text "text", it obtains the markSet from the current active node (the paragraph), but the first deactivateMark has already removed the em. As a result, the "text" part of the string no longer has an em mark, even though it should.
  5. When the second deactivateMark for *This is *italic* text* is called, it tries to remove an em mark from the paragraph's markSet, but there is nothing there anymore due to step 3.

Conclusion

Due to this issue, the final part "text" in the tokenized result incorrectly lacks the em mark. The problem arises because the parser is not handling the nested activation and deactivation of the same mark type correctly.

Request and Question

I am aware that there are several ways to solve this issue without modifying ProseMirror’s core code, but I think it is less elegant coding and a little bit messy in terms of coding style.

Bistard commented 1 month ago

The response from the author of ProseMirror.