KaTeX / KaTeX

Fast math typesetting for the web.
https://katex.org
MIT License
18.26k stars 1.17k forks source link

update buildHTML to output HAST #800

Open sandhose opened 7 years ago

sandhose commented 7 years ago

Hi,

We're doing a markdown engine using remark with the rehype-katex plugin. The thing is, rehype-katex is using KaTeX's renderToString function, and parses the output using rehype-parse (which uses parse5 to parse the HTML string into HAST, rehype's AST).

This is really inefficient (haven't done benchmarks, but it adds ~250Kb to the bundle), and could be improved by allowing to render to a hyperscript like function. By doing this, it would allow to render far more efficiently KaTeX into frameworks like React or Vue.js.

I'll write a proof of concept today ; I might not have the time to write the doc nor tests though.

I know there's a lot of links up here because of our use case, but TL;DR: it would be nice to be able to render KaTeX directly through a hyperscript like function.

kevinbarabash commented 7 years ago

@sandhose in https://github.com/Khan/react-components/blob/master/js/tex.jsx we use dangerouslySetInnerHTML to side step this issue in React. I'm not sure if Vue.js has a similar mechanism or not.

sandhose commented 7 years ago

I'm not using Vue.js in this case, but the issue is quite the same. You could avoid using dangerouslySetInnerHTML by using React.createElement (with a small wrapper) as an hyperscript function to the renderer. By doing this, you're creating real React elements, and you take advantage of React's virtual dom.

In my case, the output of the markdown engine is in one stage represented by a HAST node (HAST is a HTML AST). The way the plugin that parses the math elements in the markdown works, is that it passes the raw math input to KaTeX through the renderToString function, and then uses rehype-parse to parse the HTML string to a HAST node. This parse is heavy, and the parser itself adds ~250Kb to the bundle.

It could be avoided by directly transforming the KaTeX tree to an HAST tree, but buildTree doesn't seem to be publicly exported, and thus has no stability guarantee. I'm proposing here to provide a renderHyperscript function that renders using the provided hyperscript function. This would totally cover my case (because there's hastscript that provides a hyperscript function to create a HAST node), and would be beneficial in frameworks with some kind of virtual dom like React or Vue.js (they all provide a hyperscript-like API, either officially or by the community)

kevinbarabash commented 7 years ago

@sandhose I think you should be able to use the output of buildHTML and use that to generate a HAST tree. I had a look at https://github.com/syntax-tree/hast and it may make sense to modify buildHTML to produce a HAST tree. I don't see any reason for us to have our own HTML tree structure when there a more standard one exists.

sandhose commented 7 years ago

It might be a good idea (and @wooorm would be pleased). I'll try to do something.

FYI I have a branch that implements hyperscript rendering, with a react example here (built version here)

sandhose commented 7 years ago

Well, it just got a lot harder with #807 because it means I have to re-parse the innerHTML in spans to render them (and it is expensive).

kevinbarabash commented 7 years ago

@sandhose could we not add a innerHTML property which is a string? https://github.com/syntax-tree/hast#properties seems to indicate that both attributes and properties.

ronkok commented 7 years ago

it just got a lot harder with #807

Would it help if we wrapped every <svg> with a span with a descriptive class, as in:

<span class="rightarrow"><svg>...</svg></span>

Or perhaps write the class into the SVG?

<svg class="rightarrow">...</svg>
kevinbarabash commented 7 years ago

@sandhose if you're not modifying the nodes within the SVG is there really any benefit from the SVG being described as virtual DOM nodes as opposed to a string?

wooorm commented 7 years ago

@kevinbarabash Hi! đź‘‹

could we not add a innerHTML property which is a string? https://github.com/syntax-tree/hast#properties seems to indicate that both attributes and properties.

HAST is for HTML, so think of it as only “attributes” being supported, not DOM properties like the innerHTML setter.

I suggest against using raw HTML inside a virtual DOM, for the same reasons that React uses the name dangerouslySetInnerHTML — it’s dangerous and slow.

I know it’s not always possible, but using an object structure (like HAST, or your own) instead of building strings makes things great for non-server-side rendering!

kevinbarabash commented 7 years ago

@wooorm good point about perf during client-side renders. @ronkok what are your thoughts on build an AST for the inline SVG bits and then using createElement and appendChild to render them?

I'd want to try HAST format for the non-SVG parts of the tree first and see how that goes before putting in the effort convert the SVG parts.

ronkok commented 7 years ago

@kevinbarabash I'm all in favor of what you suggest. I may not be the best person to implement it. Let me look into it and get back to you.

kevinbarabash commented 7 years ago

@sandhose some of our nodes output document fragments. Is there a way to model that with HAST? Would it just be an array of HAST nodes?

wooorm commented 7 years ago

@kevinbarabash Yup! You can opt for an array of nodes.

Or if you’d like, returning a root node ({type: 'root', children: [...]}) is also fine, but root nodes shouldn’t be inserted somewhere else in a tree (only the top node may be a root)

cjh9 commented 6 years ago

It would be awesome to have some kind of public API to build vnode trees. Right now I'm doing:

Katex tostring => fast HTML parser => build vnode tree in MithrilJS with hyperscript calls (recursively walk the three).

Is there a way I could skip the parsing? Should I look internally for buildTree and modify the katex source or is this feature coming soon?

edemaine commented 6 years ago

@cjh9 If you don't need MathML, you could probably call buildHTML directly; if you want both, buildTree would be good. If you can successfully import them (#954 might get in your way, but we'd appreciate a fix to that), then they should just work, and return the existing internal node tree data structures.

If they're helpful, I don't see any reason not to expose buildTree, buildHTML, and buildMathML in katex.js's module.exports, presumably prefixed with __ to make it just as scary/unsupported as __parse (though these methods are currently probably more stable than __parse). Any objections?

cjh9 commented 6 years ago

@edemaine Sorry for late response, yes that would be awesome if they could be exposed in the distribution! In what format would __buildHTML return the tree, in real HTML-nodes or more light-weight Json format? Only the later seem to integrate well with webworkers..

edemaine commented 6 years ago

@cjh9 This is now available in the master branch, thanks to #1017. The nodes are returned in a custom nested Javascript data structure (objects containing children array fields). Hope that helps!

cjh9 commented 6 years ago

@edemaine Super great! 🎉 And it is also serializable to JSON :) Would it also be possible to expose __bulidTreeHTML if I don't need MathML? Not super important though I can work around It :)

ry-randall commented 6 years ago

@cjh9 I'm guessing you already know this, but you should be able to leverage buildTreeHTML via the buildTree. Would that work? Exposing just the HTML would probably require extracting some of the default options (i.e. https://github.com/Khan/KaTeX/blob/master/src/buildTree.js#L19)

edemaine commented 6 years ago

I think pure HTML export makes sense when you're rendering it in a custom way (e.g. SVG), as you're unlikely to also be able to include MathML in that setting. I could see either

  1. factoring out the Settings to Options conversion in buildTree so that we can write a new buildHTMLTree, or
  2. adding an option in Settings to prevent MathML creation.

Thoughts?

ry-randall commented 6 years ago

Hmm, I think 1. sounds good. My concern with 2. is how that would affect items downstream (i.e. the buildTree). Perhaps I don't understand it well enough though.

edemaine commented 6 years ago

@cjh9 The master branch now has (via #1022) __renderToHTMLTree that outputs just the HTML part. Also, we renamed the method you were using to __renderToDomTree for more consistent naming. Hope this helps!

cjh9 commented 6 years ago

@edemaine Sorry for late reply, super great! You guys are awesome :D

kevinbarabash commented 6 years ago

After investigating the hast format some more I've concluded that it's not appropriate for our use case, in particular:

I would like to simplify our current in memory HTML objects to be plain objects instead of classes, but I think that storing classes as an array and styles as an object is a superior especially for checking for the presence of particular styles or CSS classes or for modifying those.

After we refactor those objects (and extract non-HTML props into an intermediate representation) they should be stable enough (and simple enough) that writing a translator from our HTML objects to hast should be trivial.

wooorm commented 6 years ago

This is not entirely true: fragments can be stored in a root too, and className is an array! Finally, style could be discussed. It used to be an object in fact, and could be mapped to that again, pending further discussion.

kevinbarabash commented 6 years ago

@wooorm thanks for pointing out className. I should've read the section on "Property values" more closely.

There’s no special format for style.

What does that mean? Is it a string or an object?

Are there any examples of how fragments are dealt with? We'll never return a fragment so fragment support isn't a deal breaker.

wooorm commented 6 years ago

There’s no special format for style.

What does that mean? Is it a string or an object?

It used to be under discussion, but removed in 2016. There are downsides to doing style as an object, because you need to parse styles in some cases, which includes quite the library. In other cases, you need to stringify it, which is less of a problem.

Are there any examples of how fragments are dealt with? We'll never return a fragment so fragment support isn't a deal breaker.

Any document, whether it’s a complete one or a fragment, is stored in a root node. There’s no other handling for it. To be honest, now I’m not entirely sure what your use case is!

wooorm commented 5 years ago

@kevinbarabash Is there still interest in doing this? Are there reservations?

If so, I may be able to work on it the coming weeks. Could you estimate the time involved with changing the underlying objects to a different format?