JohannesKaufmann / html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
MIT License
843 stars 82 forks source link

🐛 Bug: Support MathJax custom tags #50

Open ljrk0 opened 2 years ago

ljrk0 commented 2 years ago

Describe the bug MathJax is a JavaScript library allowing to add "custom tags" such as $...$ to HTML which will then be turned into e.g., MathML or whatever the browser supports.

Depending on the Markdown implementation math is either not supported at all -- or directly through the same syntax. Either way, it'd probably make most sense to simply keep $...$ expressions intact and not escape strings contained therein. While a simple filter for that would certainly work, MathJax allows supporting different escape characters than $...$ for inline- and $$...$$ for display-math, e.g., from the article https://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/:

<script>
window.MathJax = {
  tex: {
    tags: "ams",                                                                       inlineMath: [ ['$','$'], ['\\(', '\\)'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
  },
  options: {
    skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
  },
  loader: {
    load: ['[tex]/amscd']                                                            }
};
</script>

This would necessate parsing Js though ...

HTML Input

some formula: $\lambda$

Generated Markdown

some formula: $\\lambda$

Expected Markdown

some formula: $\lambda$

Additional context This filter (or "unfilter") may be only activated, if MathJax is detected, and otherwise disabled. Further, as mentioned earlier, a more sophisticated parsing of the HTML may be used to detect the precise math-HTML tags used or make them configurable at the least.

JohannesKaufmann commented 2 years ago

I don't think getting the content between the $ signs will always work, as it can also be server-side-rendered. Luckily it seems like both MathJax and Katex (also) support the <math> tag.

So a math plugin would need to support both methods:

it will typically have a $\lambda$-expression as argument.
<mjx-assistive-mml unselectable="on" display="inline">
  <math xmlns="http://www.w3.org/1998/Math/MathML">
    <mi>λ</mi>
  </math>
</mjx-assistive-mml>

I won't add this plugin anytime soon, as it would be a lot of work. But this plugin should exist! Ideally maintained by someone better in math than me 😅

I'm planning a v2 of the library. Maybe I will add it then...


You could already help by collecting various snippets from websites you encounter. This should cover a variety of uses (e.g. client-side-rendering, server-side-rendering, different libraries, content that looks like math but is NOT, ...)

See this file as an example. It follows this pattern:

<!-- https://example.com/page1 -->
<div>snippet 1</div>

<hr />

<!-- https://example.com/page1 -->
<p>snippet 2</p>

<hr />

...
ljrk0 commented 2 years ago

Thanks for implementing #49 so quickly!

Yeah, MathJax supports LaTeX-Style, MathML as well as AsciiMath. Converting MathML to Markdown however is probably quite much work. Simply "passing through" dollar-signs if so-configured in the scripts may work "good enough" for most use cases though?

I've just noticed that pandoc can do just the thing:

pandoc --from=html+tex_math_dollars+tex_math_single_backslash+tex_math_double_backslash \
       --to=markdown \
       --output=foo.md \
       input.html

You can also choose --to=html to convert e.g., `$\lambda. \dots$ to:

<span class="math inline"><em>λ</em><em>i</em>.…</span>

Which works good enough for my use cases for now. Adding real $ support is quite tricky, especially when it comes to finding the closing tag etc.

Regardless, I will collect examples I stumble upon :)