ShayHill / docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
https://docx2python.readthedocs.io/en/latest/
MIT License
164 stars 35 forks source link

docx2python cann't read the mathematical equation #15

Closed sreeroopnaidu closed 3 years ago

sreeroopnaidu commented 3 years ago

it will show empty array []

ShayHill commented 3 years ago

Docx2Python v2 now recognizes <m:t> elements, so will capture some information from equations. For equations in Linear format, Docx2Python v2 will export valid Latex.

Equations in "Professional" format will not return anything useful. A simple integral from 0 to 1 would return "01x". It might be straightforward to write a parser to replace

Someone might come behind me and do it, but there's little return in going down that (pretty much impossible to comprehensively test) road, as Word will easily convert all equations in a document to "Linear" format. These now export nicely from Docx2Python v2. That same integral in Inline format will export as:

'\\int_{0}^{1}x'

Here's a peek at the xml for a summation in Professional format. The information is there if anyone wants to extend this module with a parser. I suggest not for the previously mentioned testing issues.

<m:nary>
    <m:naryPr>
        <m:chr m:val="∑"/>
        <m:limLoc m:val="subSup"/>
        <m:ctrlPr>
            <w:rPr>
                <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
                <w:i/>
            </w:rPr>
        </m:ctrlPr>
    </m:naryPr>
    <m:sub>
        <m:r>
            <w:rPr>
                <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
            </w:rPr>
            <m:t>
                0
            </m:t>
        </m:r>
    </m:sub>
    <m:sup>
        <m:r>
            <w:rPr>
                <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
            </w:rPr>
            <m:t>
                1
            </m:t>
        </m:r>
    </m:sup>
    <m:e>
        <m:r>
            <w:rPr>
                <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
            </w:rPr>
            <m:t>
                x
            </m:t>
        </m:r>
    </m:e>
</m:nary>

Thank you, sreeroopnaidu.

usr3 commented 3 years ago

@ShayHill Is it possible to add delimiters between the exported Latex so as to identify those as equations? Something as done in this library: https://github.com/hrushikeshrv/docxlatex#usage

ShayHill commented 3 years ago

What delimiter do you suggest?

Sent from my iPhone

On Nov 2, 2021, at 11:55 PM, usr3 @.***> wrote:



@ShayHillhttps://github.com/ShayHill Is it possible to add delimiters between the exported Latex so as to identify those as equations? Something such as done in this library: https://github.com/hrushikeshrv/docxlatex#usage

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/15#issuecomment-958655309, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE7CZUYZGF2W6J3GI6TUKC6FDANCNFSM43SEGSEQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

usr3 commented 3 years ago

What delimiter do you suggest?

We can use a similar delimiter as used for images, footnote etc. It works really well with regex.

----latex e = mc^2---- or

----equation x = {-b \pm \sqrt{b^2-4ac} \over 2a}----

ShayHill commented 3 years ago

I am going to upload v2.0 to PyPi by end of November. Will include a delimiter for equations.

Still deciding between what you suggest and .

Sent from my iPhone

On Nov 4, 2021, at 7:14 AM, usr3 @.***> wrote:



What delimiter do you suggest?

We can use a similar delimiter as used for images, footnote etc. It works really well when working with regex.

----latex e = mc^2---- or

----equation x = {-b \pm \sqrt{b^2-4ac} \over 2a}----

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/15#issuecomment-960774944, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE7I26GRV7LFKSNQU5DUKJ2KFANCNFSM43SEGSEQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ShayHill commented 3 years ago

Thank you very much. I will look into this.

From: usr3 @.> Sent: Saturday, November 6, 2021 3:47 AM To: ShayHill/docx2python @.> Cc: Shay Hill @.>; Mention @.> Subject: Re: [ShayHill/docx2python] docx2python cann't read the mathematical equation (#15)

Just to report, the latex being returned contains an extra \ for every backslash which breaks the equation. For instance, B=\left[\begin{matrix}-1&0\0&-1\\end{matrix}\right] becomes B=\left[\begin{matrix}-1&0\\0&-1\\\end{matrix}\right]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/15#issuecomment-962419968, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE7LAIHWLMIJ63U47CTUKTTPFANCNFSM43SEGSEQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

usr3 commented 2 years ago

@ShayHill Have you decided on a delimiter for equations? I can send a PR with the delimiter you suggest.

ShayHill commented 2 years ago

I like with the entire construction as a separate run, the way links work now. Not sure if there’s an xml container object around mt elements though, so it might not be straightforward to implement this.

A pr would be great.

On Dec 23, 2021, at 3:33 AM, usr3 @.***> wrote:



@ShayHillhttps://github.com/ShayHill Have you decided on a delimiter for equations? I can send a PR with the delimiter you suggest.

— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/15#issuecomment-1000167203, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE6QY37PNNRR43LG7HLUSLUGHANCNFSM43SEGSEQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

usr3 commented 2 years ago

@ShayHill Not sure if it's the right way, but sent PR #28 which uses the parent to get the latex. insert_text_as_new_run should also work.