ShayHill / docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
https://docx2python.readthedocs.io/en/latest/
MIT License
163 stars 34 forks source link

Latex equation output incorrect when html=True #30

Closed usr3 closed 2 years ago

usr3 commented 2 years ago

When <, > is present inside equations, it's converted to html entities which is incorrect inside tags.

The file attached equations.docx gives the following output: [[[['Professional Format', '<latex>CO32-&lt;CO2&lt;CO</latex>', 'Linear Format', '<latex>CO_3^{\left(2-\right)}&lt;CO_2&lt;CO</latex>']]]]

This is due to commit be6a7892bbd61750bbabb0391a234473af2a6eea in https://github.com/ShayHill/docx2python/blob/be6a7892bbd61750bbabb0391a234473af2a6eea/docx2python/docx_text.py#L152

ShayHill commented 2 years ago

Addressed. I'll have to take your word for it, because I'm not an expert on html. From my tests, it seems &lt; and < render the same in Edge and Firefox. Thank you.

Professional Format<br>
<code><latex>01x</latex></code><br>
<latex>01x</latex><br><br>

Linear Format<br>
<code><latex>\\int_{0}^{1}x</latex></code><br>
<latex>\\int_{0}^{1}x</latex><br><br>

Linear Format with <code>&lt;</code><br>
<code><latex>\\int0&lt;1x&lt;5</latex></code><br>
<latex>\\int0&lt;1x&lt;5</latex><br><br>

Linear Format with &lt;<br>
<code><latex>\\int0<1x<5</latex></code><br>
<latex>\\int0<1x<5</latex><br><br>

Linear Format with <code>&lt;&gt;</code><br>
<code><latex>\\int0&lt;&gt;1x&lt;&gt;5</latex></code><br>
<latex>\\int0&lt;&gt;1x&lt;&gt;5</latex><br><br>

Linear Format with &lt;&gt;<br>
<code><latex>\\int0<>1x<>5</latex></code><br>
<latex>\\int0<>1x<>5</latex><br><br>

Linear Format with <code>&lt;1x&gt;</code><br>
<code><latex>\\int0&lt;1x&gt;5</latex></code><br>
<latex>\\int0&lt;1x&gt;5</latex><br><br>

Linear Format with &lt;1x&gt;<br>
<code><latex>\\int0<1x>5</latex></code><br>
<latex>\\int0<1x>5</latex><br><br>
usr3 commented 2 years ago

Addressed. I'll have to take your word for it, because I'm not an expert on html. From my tests, it seems &lt; and < render the same in Edge and Firefox. Thank you.

Thanks. Yes you're correct, &lt; will render as < in the browser as it is a valid html entity. It's just that being inside the latex breaks it from being parsed by libraries such as mathjax.

ShayHill commented 2 years ago

Good to know. Thank you.

From: usr3 @.> Sent: Thursday, December 30, 2021 2:00 PM To: ShayHill/docx2python @.> Cc: Shay Hill @.>; State change @.> Subject: Re: [ShayHill/docx2python] Latex equation output incorrect when html=True (Issue #30)

Addressed. I'll have to take your word for it, because I'm not an expert on html. From my tests, it seems < and < render the same in Edge and Firefox. Thank you.

Thanks. Yes you're correct, < will render as < in the browser as it is a valid html entity. It's just that being inside the latex breaks it from being parsed by libraries such as mathjaxhttps://github.com/mathjax/MathJax.

— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/30#issuecomment-1003163889, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIEY2W6EFET2NSPBRZODUTS237ANCNFSM5K6L5HGQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you modified the open/close state.Message ID: @.**@.>>