Ampersand escapes in text content handled differently to JSX

natevw commented 5 years ago

Reproduction

With the following code:

import htm from "https://unpkg.com/htm@2.1.1/dist/htm.mjs";
import { h } from "https://unpkg.com/preact@8.4.2/dist/preact.mjs";
const html = htm.bind(h);

console.log(html`<div>&lt;</div>`)

I get a VNode with .children = ["<"] — i.e. what I mean to render as < gets rendered as < instead.

Expected results

Testing this in JSX via say https://jsx.egoist.moe/?mode=vue, <div><</div> gets transformed into h("div", ["<"]); as I originally expected.

I haven't poked into how they are doing this… seems like something that ultimately relies on a lookup table.

Workaround

I am able to escape via string interpolation (e.g. ${'<'} instead of <). Changing the last line in my sample code to:

// …

console.log(html`<div>${'<'}</div>`)

Results in a VNode with .children = ["<"] as I need. Is this the recommended style?

developit commented 5 years ago

This is a tough one. As you pointed out, this requires a lookup table for a portable implementation. A browser-specific implementation could leverage the DOM to transform HTML entities, but there would be performance implications. My usual approach is to pull the text out into strings as you described, but it seems like we'll want a solution for this in order to maintain some semblance of parity with JSX.

natevw commented 5 years ago

Most of the rather large HTML set seems more for "typing on ASCII keyboard" (convenience) than necessity. How about supporting only the "core" ones that are built into XML? They are the ones needed syntactically:

quot, amp, lt, gt, and apos*?

*iiuc ' was not standardized on the HTML side until the HTML5 spec, so maybe it could be left out?

There's also the decimal/hex forms; personally I don't see a strong need for those but they'd just be code rather than LUT entries.

microlancer commented 5 years ago

Until someone merges the correct solution, here's what I'm using temporarily.

In my class, I added:

    decode(str) {
        const s = "<b>" + str + "</b>";
        let e = document.createElement("decodeIt");
        e.innerHTML = s;
        return e.innerText;
    }

And to use it:

render() { 
  return html`<div>Hello ${this.decode('&middot;')} Goodbye</div>`;
}

It's probably not good code, but it works for me at least. Might help someone else too.

goranmoomin commented 5 years ago

@thorie7912 Wouldn't that be super slow since that is creating a DOM node every time you render? I'm not sure if it's a good idea.... It would be better using a package like unescape

microlancer commented 5 years ago

Yeah, its very slow. But like I said, it's temporary. I hope this ticket gets resolved soon. I don't think I can use a package like unescape, because I'm not using NodeJs. I'm directly pulling HTM from a CDN.

This could be optimized, if we keep one DOM node available for doing all conversions. Then we don't need to recreate a new DOM node every time. It's only using the text, and it can be replaced for each entity we want to convert.

natevw commented 5 years ago

@thorie7912 For character entities like that it would be better to manually decode them yourself. Assuming you can save/serve your file as UTF-8 then it will read well as simply:

render() { 
  return html`<div>Hello · Goodbye</div>`;
}

If your content can only be served as ASCII (and you also can't <meta charset="utf-8"> inline):

render() { 
  return html`<div>Hello \u00b7 Goodbye</div>`;
}

There are only a couple characters where you can't always do this; e.g. the < character would get parsed as the opening of a tag in some contexts, even if "escaped" [at the source-code level] as \u003c. For those, again rather than sending the entity out to the DOM for decoding, simply pre-convert and "escape" them via the original workaround above:

render() {
  return html`<div>Hello ${'<'} Goodbye</div>`;
}

developit commented 5 years ago

Just wanted to note that I've read this and am pondering what we could do to move forward.

@natevw I like your point about which ones are needed (vs wanted for compat). From a purely design perspective, HTM's parser interprets <>, but does not offer a mechanism for escapement. That seems worth rectifying to me, but I worry about special-casing characters.

farskid commented 5 years ago

I also wanted the output in an unescaped way and as @pcr910303 mentioned, decode(htm(some jsx here)) worked very well.

matthewp commented 5 years ago

Ran into this as well. Please keep htm portable, I am using it in a web worker. 😀

radum commented 3 years ago

Hello, I was looking into this issue and I stumbled into this. My use case looks like this:

render(html`
<style>
.selector > * {
    padding-top: 0.75rem;
}
</style>

<h1>HTML</h1>
`);

And I use preact-render-to-string to convert it and use it in an 11ty file. Using ${'>'} still doesn't solve the problem and makes it impossible to use inline styles using >.

I understand the problem and I get why this should not really be fixed directly in HTM, but I wonder if there is a workaround to mu issue above, or if we can import something extra o handle this scenario.

One way to fix my issue is to

render(html`...`).replace(/&gt;/gi, ">");

But I would rather not change all of them in my output html.

Haroenv commented 1 year ago

For those use cases, as they are text, you likely could opt-out of htm, right?

render(html`
<style>
${`.selector > * {
    padding-top: 0.75rem;
}`}
</style>

<h1>HTML</h1>
`);

developit / htm