Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.75k stars 270 forks source link

Transform <math> tag to math block in markdown (with $ symbol) #336

Open winlp4ever opened 3 years ago

winlp4ever commented 3 years ago

Good morning, I'm using html2text for my projects. There's one problem which is html2text doesn't 'case' formulas inside $$. I wonder if there is any way or if not yet, any plan for transforming blocks to $...$ blocks in Markdown. Thank you,

jeremydouglass commented 3 years ago

Can you provide an example of an input and a desired output? Are you talking about HTML5 MathML?

https://www.tutorialspoint.com/html5/html5_mathml.htm

winlp4ever commented 3 years ago

Yes, here is an example, Input:

<math xmlns="http://www.w3.org/1998/Math/MathML"  alttext="{\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}">
--
  | <semantics>
  | <mrow class="MJX-TeXAtom-ORD">
  | <mstyle displaystyle="true" scriptlevel="0">
  | <mi>y</mi>
  | <mo>=</mo>
  | <mrow class="MJX-TeXAtom-ORD">
  | <mi mathvariant="bold">w</mi>
  | </mrow>
  | <mo>&#x22C5;<!-- ⋅ --></mo>
  | <mrow class="MJX-TeXAtom-ORD">
  | <mi mathvariant="bold">x</mi>
  | </mrow>
  | <mo>+</mo>
  | <mi>b</mi>
  | </mstyle>
  | </mrow>
  | <annotation encoding="application/x-tex">{\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}</annotation>
  | </semantics>
  | </math>

Expected Output: $${\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}$$

So, it's not just MathML, this example is taken from wikipedia, as you can see, there's latex inside (I believe this is MathML + MathJax).

In general, I understand there's a lot of ways to write math in html, but given the fact that MathJax is becoming a standard nowadays for doing that, I wonder if it's possible to do that with html2text

jeremydouglass commented 3 years ago

It sounds like perhaps you could do this as a preprocessing step? Pass it through your math pipeline of choice, then pass the result -- html, but with with $$ strings instead of -- into html2text, arriving at your desired output?

winlp4ever commented 3 years ago

That's what I do now with regex. Thank you.

jeffkaufman commented 10 months ago

I think it would be great if html2text could handle <math>, since this project is upstream from a lot of others and the main browsers all support MathML now. For example, rss2email can't handle <math> (https://github.com/rss2email/rss2email/issues/242#issuecomment-1734572257) but could if html2text did.

Instead of extracting LaTeX from an <annotation encoding="application/x-tex"> block, which isn't guaranteed to be there, I think it would be better to directly convert the MathML. I think this would look like taking a dependency on ]py_asciimath and then adding special handling for <math>.

Would you be open to a PR doing this?