Proposal: markdown parsing for react-native-render-html

Expensify / App

Welcome to New Expensify: a complete re-imagination of financial collaboration, centered around chat. Help us build the next generation of Expensify by sharing feedback and contributing to the code.

https://new.expensify.com

MIT License

3.47k stars 2.82k forks source link

Proposal: markdown parsing for react-native-render-html #3983

Closed jsamr closed 2 years ago

jsamr commented 3 years ago

This is a follow-up on a request from @roryabraham. I have investigated the options to integrate Markdown parsing a little further, and here are my findings:

Markdown parsing is hard. While XML or HTML markups are not context sensible, Markdown is. So implementing my own tokenizer is a no-go for a production-ready deliverable (although I might try the challenge as a hobby at some point).
Changing the structure of the DOM would bring too many breaking changes, so using remarkjs is a no-go.
The solution would be intermediary: use an open-source, well tested markdown tokenizer and plug that in htmlparser2 Tokenizer class to emit a DOM.

Chosing a Tokenizer

I've forked this benchmark to add micromark which I found very well structured and solid (via remark-html), and below are my findings (Intel i7-8809G, 32GB of RAM, Nodejs 14.16.0).

Average Ops per second

avg-ops-per-sec

Minmax parse time

minmax-parse-time

Average Throughput

avg-throughput

Conclusion

Markdown-it is the clear winner, since there is no official web assembly support in React Native. Other plus:

Great ecosystem with many plugins and GFM support including emojis;
Safe by default;
Great maintenance metrics (5 open issues).

Implementation Plan

Get inspiration from MarkdownIt.Renderer: consume a token tree from MarkdownIt.parse and invoke corresponding htmlparser2 callbacks while walking the tree.

I'll also need some help to assess which features you want to enable for Expensify.cash.

Package Design

I need to think of a new package design since I don't want @native-html/core to depend directly on markdown-it.

Testing Strategy

The parser will be tested against the official commonmark-spec repository.

MelvinBot commented 3 years ago

Triggered auto assignment to @johncschuster (AutoAssignerTriage), see https://stackoverflow.com/c/expensify/questions/4749 for more details.

quinthar commented 3 years ago

To help me understand, can you confirm:

Are you sure wasm isn't supported? @robertjchen is planning on using it for encryption, I think.
Can you benchmark this against our current custom markdown>html parser?
I thought the idea was translate directly from markdown to DOM, skipping HTML entirely. But this seems like it is just benchmarking various open source markdown>html converters. Can you restate the goal?

Thanks!

roryabraham commented 3 years ago

Nice, markdown-it looks like it has a very active community, which we love to see! I'm also excited about the rule customization and plugin architecture it offers.

I'll also need some help to assess which features you want to enable for Expensify.cash.

Let me know what questions you have and how I can help! I am envisioning a phased scope-of-work that might look something like this:

You work on a POC of the MD integration in react-native-render-html
Once you've got a working POC, you work on benchmarking MD parsing v.s. HTML parsing in react-native-render-html so we can get a clear idea of the performance implications there.
Next, assuming the benchmarks are promising, we work together on implementing direct MD rendering into Expensify.cash.
1. There's a good chance we'll need an internal engineer to work on this phase with you, so we can locally test any necessary API changes.
2. I think the most challenging aspect of that integration will be that we'll need to be careful to maintain backwards-compatibility so that any necessary API changes don't break older versions of the application.
Once we have a working implementation of MD parsing via react-native-render-html in Expensify.cash, we'll do benchmark testing of Expensify.cash before (with our current markdown <-> HTML converter) and analyze the performance implications before we decide to use it in production.

Also, as @quinthar has suggested above, if you could benchmark markdown-it, etc... against our current markdown -> HTML converter (which lives in our expensify-common repo) that would likely be a great first step before we dive into the phased implementation outlined above.

I thought the idea was translate directly from markdown to DOM, skipping HTML entirely.

This is my understanding as well.

jsamr commented 3 years ago

@quinthar

Are you sure wasm isn't supported? @robertjchen is planning on using it for encryption, I think.

My understanding is that WASM is not officially supported at the moment and there is an open ticket for Hermes engine, but you can use react-native-wasm to execute WASM in a WebView. Hence I'd rather wait for official support before considering markdown-wasm.

I thought the idea was translate directly from markdown to DOM, skipping HTML entirely. But this seems like it is just benchmarking various open source markdown>html converters. Can you restate the goal?

Yes, the goal is still to translate markdown to DOM! The benchmark is a quick glimpse at each parser capability, assuming that the "whichever-kind-of-syntax-tree-structure to HTML part" performance would be sensibly the same for every engine. I could rewrite the benchmark for accuracy, but honestly I looked for the fastest-to-implement path. To develop the statement in the OP, the plan is to use markdown-it tokenizer and plug it in a htmlparser2 Tokenizer derived class, and then pass this class reference to htmlparser2 Parser which in turns emits events consumed by domhandler to build up a DOM.

Can you benchmark this against our current custom markdown>html parser?

The thing is, all the tested markdown parsers claim compliance with CommonMark, and I doubt this ad-hoc ExpensiMark parser does. So to benchmark your parser, I would need to tailor tests scenarios and make sure the parsers are working in equal conditions. I'd be happy to do that in a preliminary "0" step!

@roryabraham

That sounds like a solid plan. As react-native-render-html development model is through consulting and sponsoring, would you go along with the idea of financing the initial exploratory steps? And by the way, if ExpensiMark turns out to be faster because it's tailored for your needs, I could also consider a POC based on it (however, its reliance on regex to identify tokens makes me doubt it will).

MelvinBot commented 3 years ago

Triggered auto assignment to @tgolen (Engineering), see https://stackoverflow.com/c/expensify/questions/4319 for more details.

robertjchen commented 3 years ago

My understanding is that WASM is not officially supported at the moment and there is an open ticket for Hermes engine, but you can use react-native-wasm to execute WASM in a WebView

To clarify, this is somewhat true- we won't be able to use pure WASM in React without the WebView workaround, but we can still compile markdown-wasm to the highly optimized asm.js JS subset (now confusingly under the umbrella of WebAssembly™️ ).

Theoretically this might net us performance results between markdown-it and pure WASM markdown-wasm. Just another option to consider 👍

jsamr commented 3 years ago

@robertjchen Thanks for pointing that out! I must confess that I'm not proficient in WASM nor asm.js. After reading markdown-wasm in surface, I realize that it's basically a wrapper around md4c, compiled in wasm. For our usecase, we would need to get the parse tree generated by md4c, and access the tree in memory with some JS bindings. I don't know if WASM offers this capability! Nonetheless, JSI (JavaScript Interface) offers this feature with C/C++ so that would probably be a better fit! However I am not a C/C++ developer (although I'd love to get my hands on), but I find the prospect very interesting. The Graal would be a transient render engine written in C++ with JSI bindings, but that is clearly a huge endeavor and obviously out of scope :rofl:

jsamr commented 3 years ago

@roryabraham @quinthar

Can you benchmark this against our current custom markdown>html parser?

I changed my mind and thought that if it does perform worse, that will be flagrant whether or not the same set of syntax rules are covered. I managed to incorporate ExpensiMark in the benchmark (although in a hacky way because it uses ES modules , I just copied the required sources).

As I had anticipated, ExpensiMark turns out to be pretty slow in comparison to other parsers (I'm still dumbfounded by the bad performance of remark). This is probably because of its regex-based parsing, which requires vastly more input traversals than a automaton "finite state machine" based parser.

quinthar commented 3 years ago

I would expect that over time we're going to want our own parser, to do nonstandard things -- we want to own our own grammar. But this is a huge testament to the power of WASM and we should absolutely be aiming for that. Can we just port ExpensiMark to C++ and compile to WASM? No need to use regexp.

jsamr commented 3 years ago

@quinthar If you can get your hands on C/C++, I'm not sure I see any advantage of using WASM over JSI. AFAIK WASM is still interpreted, while C/C++ will be run in an architecture-specific binary. Plus, JSI is the future standard for performant low-level react-native modules... An important thing to consider if you want to follow that scheme (porting ExpensiMark to C++ based on md4c), is converting back HTML to markdown. But there are good and fast C++ HTML parsers and that part should be much easier.

No hurry, but please let me know if you're still interested in markdown being supported directly in RNRH, most likely with markdown-it. mardown-it allows you to create new syntax rules via plugins so you shouldn't feel limited in that regard.

Hopefully at some point the whole transient render engine will be ported to C++ where we could use md4c for Markdown. In any case we would make sure you can reasonably easily plug-in your own parser implementation to generate a TRT and render that in RNRH.

MelvinBot commented 2 years ago

@jsamr, this Monthly task hasn't been acted upon in 6 weeks; closing.

If you disagree, feel encouraged to reopen it -- but pick your least important issue to close instead.