Open winlp4ever opened 3 years ago
Can you provide an example of an input and a desired output? Are you talking about HTML5 MathML?
Yes, here is an example, Input:
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}">
--
| <semantics>
| <mrow class="MJX-TeXAtom-ORD">
| <mstyle displaystyle="true" scriptlevel="0">
| <mi>y</mi>
| <mo>=</mo>
| <mrow class="MJX-TeXAtom-ORD">
| <mi mathvariant="bold">w</mi>
| </mrow>
| <mo>⋅<!-- ⋅ --></mo>
| <mrow class="MJX-TeXAtom-ORD">
| <mi mathvariant="bold">x</mi>
| </mrow>
| <mo>+</mo>
| <mi>b</mi>
| </mstyle>
| </mrow>
| <annotation encoding="application/x-tex">{\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}</annotation>
| </semantics>
| </math>
Expected Output: $${\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}$$
So, it's not just MathML, this example is taken from wikipedia, as you can see, there's latex inside (I believe this is MathML + MathJax).
In general, I understand there's a lot of ways to write math in html, but given the fact that MathJax is becoming a standard nowadays for doing that, I wonder if it's possible to do that with html2text
It sounds like perhaps you could do this as a preprocessing step? Pass it through your math pipeline of choice, then pass the result -- html, but with with $$ strings instead of
That's what I do now with regex. Thank you.
I think it would be great if html2text could handle <math>
, since this project is upstream from a lot of others and the main browsers all support MathML now. For example, rss2email can't handle <math>
(https://github.com/rss2email/rss2email/issues/242#issuecomment-1734572257) but could if html2text did.
Instead of extracting LaTeX from an <annotation encoding="application/x-tex">
block, which isn't guaranteed to be there, I think it would be better to directly convert the MathML. I think this would look like taking a dependency on ]py_asciimath and then adding special handling for <math>
.
Would you be open to a PR doing this?
Good morning, I'm using
html2text
for my projects. There's one problem which ishtml2text
doesn't 'case'