fread-ink / epub-cfi-resolver

A simple parser, resolver and generator for the EPUB-CFI format
GNU Affero General Public License v3.0
18 stars 3 forks source link

html being decoded and triggering as side effect. #25

Open mbret opened 2 hours ago

mbret commented 2 hours ago

I have an epub with this content:

<section class="sect2" title="Figures">
                    <h3 class="title" id="_figures">Figures</h3>
                    <p>Coming up for a quick breath of fresh air before descending into another
                        accessibility attribute pain point, HTML5 introduces the handy new <code
                            class="literal">figure</code> element for encapsulating content
                        associated with an image, table, or code example. Grouping related content
                        elements together, as is becoming an old theme now, makes it simpler for a
                        reader to navigate and understand your content:</p>
                    <pre class="screen">&lt;figure&gt;
    &lt;img src="images/blob.jpeg" alt="the blob"/&gt;
    &lt;figcaption&gt;
        Figure 3.7 &amp;#x2014; The blob is digesting Steve McQueen in this
        unreleased ending to the classic movie.
    &lt;/figcaption&gt;
&lt;/figure&gt;</pre>
                    <p>Unfortunately, there is little support for these two new elements at this
                        time, so they get treated as no better than <code class="literal">div</code>
                        elements. That said, it’s still preferable to future-proof your data and do
                        the right thing, as support will catch up, especially since the only other
                        alternative is semantically meaningless <code class="literal">div</code>
                        elements.</p>
                </section>

The problematic is the figure section. When the method decodeEntities is called with the text node, the el.innerHTML = str instruction will actually load and run the html code that was in the figure. Unintentionally trigger http requests and other wort things.

Is it an actual bug or this side effect is expected and we should not work with live document ?

In the meantime I was trying to find a safer alternative and here what I came up with:

decodeEntities(dom: Document, str: string) {
    try {
      const parser = new DOMParser()
      const doc = parser.parseFromString(
        `<!DOCTYPE html><text>${str}</text>`,
        "text/html",
      )
      return doc.querySelector("text")?.textContent || str
    } catch (err) {
      // TODO fall back to simpler decode?
      // e.g. regex match for stuff like &#160; and &nbsp;
      return str
    }
  }

It does not crash and seems to be returning the correct node and CFI. I am not super confident regarding performance of using DOMParser vs the same document where we create an new temporary element

mbret commented 2 hours ago

Okay so another alternative I just though which could be faster is to create a temporary dom like this:

const newDom = document.implementation.createHTMLDocument()
      const el = newDom.createElement(`textarea`)
      el.innerHTML = str
      return el.value || ``

Since it's not the current dom, it does not trigger html resources. However I believe we could initialize it on the constructor and destroy it once the CFI is resolved, that could increase performances.

Additionally, why is this decoding needed to begin with ? Why isn't the string just enough ?