Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

Tables in footnotes are flattened #47

Closed Conal-Tuohy closed 1 year ago

Conal-Tuohy commented 3 years ago

The tabular layout is apparently discarded by the OpenOffice converter when converting the Word document to OpenDocument format.

Reported by @LucasHorseshoeBend

This is not one for which I can think of a suitable work around, but if it's not possible to preserve these layouts in notes, then I will have to explore other ways of doing it.

image

Conal-Tuohy commented 3 years ago

One reasonable work-around would be to encode the footnote not as a Word footnote, but simply as a sequence of paragraphs styled with a paragraph style called "note", and marking the note with a "bookmark". At the point in the text where the note should be anchored, you would insert a cross-reference to the note, using its bookmark identifier.

Re-encoding in this way any footnotes which contained tables should be fairly easy, so long as you could find them; unfortunately, there's no indication in the converted files that a table has been flattened in this way, so detecting the cases where re-encoding is necessary might need to be a manual process!

Conal-Tuohy commented 3 years ago

This batch converter (a Windows app) is supposed to be able to convert word DOC files into DOCX, which could then be scanned to produce a list of documents which contain tables inside footnotes. http://www.multidoc-converter.com/en/index.html

LucasHorseshoeBend commented 3 years ago

Thanks for this link. I'll explore, but I don't yet know how to search a batch of files to find those that that have a specific attribute. Inside a file is easy, but selecting files with specified attributes I have never been able to solve. But your comment has inspired me to try again. If I can solve that I can probably use it to identify files with that include Times font which will allow me to get the files with printed text differentiated from manuscript parts.

I will create a small set in .docx with a couple of files with such notes, and then play with them.

Best wishes Arthur

On 25 Nov 2020, at 05:09, Conal Tuohy notifications@github.com wrote:

This batch converter (a Windows app) is supposed to be able to convert word DOC files into DOCX, which could then be scanned to produce a list of documents which contain tables inside footnotes. http://www.multidoc-converter.com/en/index.html http://www.multidoc-converter.com/en/index.html — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/47#issuecomment-733468338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTXA2TSVXNMFUQIR7QTSRSGSPANCNFSM4UANLREQ.

Conal-Tuohy commented 3 years ago

I converted the corpus of Word files to Word XML (DOCX) using the aforementioned multidoc-converter app. The example document discussed above is attached here in that XML rendition (zipped so that github will accept it as an attachment). 59-04-01-final.zip This XML file appears to be a faithful rendition of the original, and amenable to querying with an XPath expression to check that it contains a table in a footnote. This means I will be able to easily generate a list of the files which have this feature, and we'll have at least a handle on the scale of the issue and what remedies will be practicable.

LucasHorseshoeBend commented 3 years ago

Dear Conal Clever. The unzipped .xml opens as such in Word and looks OK to me.

I hope that it will be easy to find what files are concerned: beyond my skill.

Best wishes Arthur

On 18 Dec 2020, at 13:37, Conal Tuohy notifications@github.com wrote:

I converted the corpus of Word files to Word XML (DOCX) using the aforementioned multidoc-converter app. The example document discussed above is attached here in that XML rendition (zipped so that github will accept it as an attachment). 59-04-01-final.zip https://github.com/Conal-Tuohy/VMCP-upconversion/files/5715926/59-04-01-final.zip This XML file appears to be a faithful rendition of the original, and amenable to querying with an XPath expression to check that it contains a table in a footnote. This means I will be able to easily generate a list of the files which have this feature, and we'll have at least a handle on the scale of the issue and what remedies will be practicable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/47#issuecomment-748088407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTTTIZORGLNASGE6HWDSVNLLBANCNFSM4UANLREQ.

nielsklazenga commented 3 years ago

@Conal-Tuohy , .docx is the current Word format. Does your pipeline only work with .doc files, or is there another reason why we cannot just keep all the files in .docx format?

Conal-Tuohy commented 3 years ago

@nielsklazenga we could keep the files in Word's current format, but that wouldn't in itself solve the issue here.

The first step in the existing pipeline is to convert the Word documents from Word97 format into OpenDocument format, using the OpenOffice command-line tool. The remainder of the pipeline is designed to process those ODF files.. The ODF is a better standard than Microsoft's "Office Open XML" (DOCX) format, which is why I chose to build the pipeline based on it. At that time, I wasn't aware of this issue with tables in footnotes, which has been discovered only recently. If I'd known, I might have opted to base the pipeline on DOCX, even though it's otherwise a less convenient format to work with.

Conal-Tuohy commented 3 years ago

I have identified the files which contain tables in footnotes.

It seems to me that if we can find a workable alternative way to encode such footnotes (i.e. without using Word's "footnote" feature), then this is a small enough number to make it a practicable task to re-encode those notes.

Conal-Tuohy commented 3 years ago

For my records, in case I have to do something similar again, with the corpus of Word XML files located in the folder /media/sf_VMCP, I used the following:

apt install libxml2-utils
find /media/sf_VMCP/ -type f -exec xmllint --xpath "//*[local-name()='footnote']//*[local-name()='tbl']" '{}' > /dev/null \; -print  >> /tmp/footnotes-in-tables.txt
grep "media" /tmp/footnotes-in-tables.txt
LucasHorseshoeBend commented 3 years ago

Thanks for the list Conal I will try out your early suggestion about styling them on a couple of test pieces soon, but not in the next day or so. Best wishes Arthur

On 21 Dec 2020, at 05:52, Conal Tuohy notifications@github.com wrote:

I have identified the files which contain tables in footnotes.

1850-9/1855/55-06-23-final.xml 1850-9/1859/59-04-01-final.xml 1860-9/1866/66-10-15b.xml 1860-9/1868/68-11-03-draft.xml 1870-9/1870/70-05-11a.xml 1870-9/1871/71-08-29.xml 1870-9/1872/72-01-24.xml 1870-9/1872/72-10-24.xml 1880-9/1886/86-08-05a-final.xml 1890-6/1892/92-01-29.xml Mentions/1870-9/77-06-12Berry-AgentGeneral.xml Mentions/1890-9/97-03-09Potter-Turner.xml — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/47#issuecomment-748774896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTQCOCGSM5ESTJW27P3SV3PCJANCNFSM4UANLREQ.

Conal-Tuohy commented 3 years ago

Last night I edited the above-mentioned document in Word, on a borrowed computer, and put a copy into the "Quarantine" folder, just to see how close it would turn out, and gauge how much programming would be needed to get it to 100% of what we need.

The result is, in Word, something that looks and even works quite similarly to the original footnote, though not quite identically.

The existing Word-to-TEI transformations produced this TEI http://vmcp.conaltuohy.com/tei/Mueller%20letters/Quarantine%20folder%20for%20problem%20files/59-04-01-final-footnote-table-test.xml

It's not quite there, but not far off:

So I would still want to take a quick look at the Word-to-ODF result just to check that the "remittance" bookmark is making the transition OK, but assuming it is, there's probably 3 or 4 more hours' programming work to get the transformations to convert the new "bookmarkup + hyperlink" note into the same TEI idiom as the "Word footnotes" notes.

Conal-Tuohy commented 3 years ago

Finally got back to this, and checking the OpenDocument XML I see the following snippet which includes the bookmark information needed (the <text:bookmark-start> and <text:boomark-end> elements).

<text:p text:style-name="Footnote">
    <text:bookmark-start text:name="remittance"/>Documents relating to McCrae&apos;s remittance are filed with this letter, as follows:</text:p>
<text:p text:style-name="P5">Exploration Fund | Mr McCrae presents his compliments to Dr McAdam &amp; begs to enclose Subscription List for the fund shewing a receipt of One Pound 15/— for which Mr McCrae now begs to enclose a Postoffice order. | Court House | Kilmore | March 23 1859</text:p>
<text:p text:style-name="P6"/>
<text:p text:style-name="P5">Mr Jamieson, requested to allow the prefixed Subscription List &amp; letter to lie on the counter of the Colonial Bank of Australasia &amp; to call the attention of customers to the same | Andrew McCrae PM [Police Magistrate] | Kilmore | Dec 16 1858</text:p>
<text:p text:style-name="P6"/>
<text:p text:style-name="P7">EXPLORATION FUND COMMITTEE.</text:p>
<text:p text:style-name="P6">
    <text:span text:style-name="T4">His Honor </text:span>
    <text:span text:style-name="T6">Sir William F. Stawell</text:span>
    <text:span text:style-name="T4">, Chief Justice, Chairman.</text:span>
</text:p>
<text:p text:style-name="P6">
    <text:soft-page-break/>SUBSCRIPTION LIST.</text:p>
<table:table table:name="Table1" table:style-name="Table1">
    <table:table-column table:style-name="Table1.A"/>
    <table:table-column table:style-name="Table1.B"/>
    <table:table-column table:style-name="Table1.C" table:number-columns-repeated="2"/>
    <table:table-row table:style-name="Table1.1">
        <table:table-cell table:style-name="Table1.A1" office:value-type="string">
            <text:p text:style-name="P6">
                <text:span text:style-name="T5">SUBSCRIBER&apos;S NAME AND ADDRESS.</text:span>
            </text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.B1" table:number-columns-spanned="3" office:value-type="string">
            <text:p text:style-name="P8">AMOUNT.</text:p>
        </table:table-cell>
        <table:covered-table-cell/>
        <table:covered-table-cell/>
    </table:table-row>
    <table:table-row table:style-name="Table1.1">
        <table:table-cell table:style-name="Table1.A2" office:value-type="string">
            <text:p text:style-name="P2"/>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.B2" office:value-type="string">
            <text:p text:style-name="P3">£</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="P3">s.</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="P3">d.</text:p>
        </table:table-cell>
    </table:table-row>
    <table:table-row table:style-name="Table1.1">
        <table:table-cell table:style-name="Table1.A2" office:value-type="string">
            <text:p text:style-name="P6">Andrew McCrae PM. Kilmore</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.B2" office:value-type="string">
            <text:p text:style-name="Footnote">1</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="P4">—</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="P4">—</text:p>
        </table:table-cell>
    </table:table-row>
    <table:table-row table:style-name="Table1.1">
        <table:table-cell table:style-name="Table1.A2" office:value-type="string">
            <text:p text:style-name="P6">J. P. Jamieson</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.B2" office:value-type="string">
            <text:p text:style-name="Footnote">0</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="Footnote">10</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="Footnote">0</text:p>
        </table:table-cell>
    </table:table-row>
    <table:table-row table:style-name="Table1.1">
        <table:table-cell table:style-name="Table1.A2" office:value-type="string">
            <text:p text:style-name="P6">J. McPherson</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.B2" office:value-type="string">
            <text:p text:style-name="P2"/>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="Footnote">
                <text:span text:style-name="T1">
                    <text:s text:c="2"/>
                </text:span>5</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C2" office:value-type="string">
            <text:p text:style-name="P2"/>
        </table:table-cell>
    </table:table-row>
    <table:table-row table:style-name="Table1.1">
        <table:table-cell table:style-name="Table1.A2" office:value-type="string">
            <text:p text:style-name="P9">£</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.B6" office:value-type="string">
            <text:p text:style-name="Footnote">1</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C6" office:value-type="string">
            <text:p text:style-name="Footnote">15</text:p>
        </table:table-cell>
        <table:table-cell table:style-name="Table1.C6" office:value-type="string">
            <text:p text:style-name="P1"/>
        </table:table-cell>
    </table:table-row>
</table:table>
<text:p text:style-name="P6">
    <text:span text:style-name="T4">This List to be forwarded, with the remittance, to the Hon. Treasurer, Dr. </text:span>
    <text:span text:style-name="T6">Wilkie</text:span>
    <text:span text:style-name="T4">, Collins Street, Melbourne.</text:span>
</text:p>
<text:p text:style-name="P6">
    <text:span text:style-name="T4">JOHN MACADAM, M.D., </text:span>
    <text:span text:style-name="T6">Hon. Secretary.</text:span>
</text:p>
<text:p text:style-name="extra_20_space">
    <text:bookmark-end text:name="remittance"/>
</text:p>

Also the hyperlink which points to the note is as you'd expect:

<text:a xlink:type="simple" xlink:href="#remittance" text:style-name="Internet_20_link" text:visited-style-name="Visited_20_Internet_20_Link">
    <text:span text:style-name="Internet_20_link">
        <text:span text:style-name="T3">5</text:span>
    </text:span>
</text:a>
LucasHorseshoeBend commented 3 years ago

Thanks Conal

Before we go any further on this line let me consult the other editors. If the outcome is to have the text outside the footnotes there are easier ways to do it within ordinary editing, producing an effectively similar outcome.

I will look at the identified files as well so that we have a good idea of the consequences of a decision. I am not sure how quickly we will get a decision but I hope pretty quickly

Best wishes Arthur

Best wishes Arthur

On 22 Dec 2020, at 05:05, Conal Tuohy notifications@github.com wrote:

Finally got back to this, and checking the OpenDocument XML I see the following snippet which includes the bookmark information needed (the and elements).

Documents relating to McCrae's remittance are filed with this letter, as follows: Exploration Fund | Mr McCrae presents his compliments to Dr McAdam & begs to enclose Subscription List for the fund shewing a receipt of One Pound 15/— for which Mr McCrae now begs to enclose a Postoffice order. | Court House | Kilmore | March 23 1859 Mr Jamieson, requested to allow the prefixed Subscription List & letter to lie on the counter of the Colonial Bank of Australasia & to call the attention of customers to the same | Andrew McCrae PM [Police Magistrate] | Kilmore | Dec 16 1858 EXPLORATION FUND COMMITTEE. His Honor Sir William F. Stawell , Chief Justice, Chairman. SUBSCRIPTION LIST. SUBSCRIBER'S NAME AND ADDRESS. AMOUNT. £ s. d. Andrew McCrae PM. Kilmore 1 J. P. Jamieson 0 10 0 J. McPherson 5 £ 1 15 This List to be forwarded, with the remittance, to the Hon. Treasurer, Dr. Wilkie , Collins Street, Melbourne. JOHN MACADAM, M.D., Hon. Secretary. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .
Conal-Tuohy commented 3 years ago

The intended outcome is not to have the text outside the footnotes, no. The intention is, rather, to produce the exact same result (i.e. a TEI <note>) from both MS Word footnotes as we currently do, and from this new bookmark-based style (which would be used only where it's necessary in order to include a table in the note).

LucasHorseshoeBend commented 3 years ago

OK, let me have a look at it again, and I will show you what I mean, or thought I meant!! I'll look at it tomorrow and discuss my interpretations with you before I consult.

Off to bed now.

Best wishes Arthur

On 22 Dec 2020, at 22:26, Conal Tuohy notifications@github.com wrote:

The intended outcome is not to have the text outside the footnotes, no. The intention is, rather, to produce the exact same result (i.e. a TEI ) from both MS Word footnotes as we currently do, and from this new bookmark-based style (which would be used only where it's necessary in order to include a table in the note).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/47#issuecomment-749808199, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTXX42SQ2U6K6QJN4FDSWEMLDANCNFSM4UANLREQ.

Conal-Tuohy commented 3 years ago

OK. To be explicit, what happens with Word footnotes is that they are converted into TEI <note> elements; the <note> element appears at the point where the footnote marker appears in the Word document. e.g.

I had yesterday the pleasure to receive your letter of the 30 Ultimo
<note xml:id="ftn2" type="footnote" n="2">
<p rend="Footnote">Letter not found.</p>
</note>

My suggestion is to make some small tweaks to the transformation pipeline so that it also produces that exact form of TEI markup from footnotes which are represented in Word in a different manner; where the content of the footnote is indicated by it being labelled with a Word bookmark, and the footnote marker is just a hyperlink pointing to that bookmark. At the moment, the transformation pipeline doesn't do this (in fact it doesn't deal with MS Word bookmarks at all), but I think that'd be an easy tweak to make, and also easy to edit the 12 Word documents to match.

LucasHorseshoeBend commented 3 years ago

Thanks for the clarification; I have now looked at all of the detected files; no false positives. In reality, two of the files identified are most unlikely to survive the final edit for the edition, being so called "mention" letters where I expect the relevant data to be buried in notes to other files, where I think it would not be necessary to retain the tabular format; at least one file does not need a table as it is a complicated way of setting out that which can be displayed more easily without the table that has been created by the transcriber.

My concern is that, based on the sample in the quarantine folder, the footnotes inserted in this way will not be numbered as part of the sequence, and indeed if the model can't be further tweaked it looks as if this example will contain two fn 5 numbers in the text, one superscripted in the ordinary way, the other inserted by the conversion/editing script. This will cause some confusion for users of the files. The current inserted "5" is an active link, that opens another copy of the whole file.

I hope this concern will be overcome, potentially editorially. Most of the notes concerned are keyed to the end of the file, or could be without too much distortion. And there is only one note per file containing a table. There are a couple, like the sample 59-04-01, where it makes more sense if the note is keyed to the point of the letter being glossed, when an unnumbered or evidently duplicated footnote will look odd, at least. I might be able to overcome that editorially by writing a different note at that point saying something like "see the un-numbered note below/ at the end of the letter."

However, the main problem will come if anyone wants to cite that note specifically in articles or other studies, so I still have residual concerns. If this could be overcome by devising a way that the notes containing tables are numbered sequentially it would be ideal, and I would not need to consult the other editors. If that is incompatible with the structure of the files and your solution, I do need to consult, as it is not appropriate to act unilaterally. I would say we must manage these files to use a table in a footnote, and commend a method of drawing attention to the existence of the relevant note by inserting a regular note where it needs to be. I would illustrate by tweaking and then using the test file.

Best wishes Arthur

On 23 Dec 2020, at 01:10, Conal Tuohy notifications@github.com wrote:

OK. To be explicit, what happens with Word footnotes is that they are converted into TEI elements; the element appears at the point where the footnote marker appears in the Word document. e.g.

I had yesterday the pleasure to receive your letter of the 30 Ultimo

Letter not found.

My suggestion is to make some small tweaks to the transformation pipeline so that it also produces that exact form of TEI markup from footnotes which are represented in Word in a different manner; where the content of the footnote is indicated by it being labelled with a Word bookmark, and the footnote marker is just a hyperlink pointing to that bookmark. At the moment, the transformation pipeline doesn't do this (in fact it doesn't deal with MS Word bookmarks at all), but I think that'd be an easy tweak to make, and also easy to edit the 12 Word documents to match.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/47#issuecomment-749861896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTSVHV64BNPOICXEIULSWE7RTANCNFSM4UANLREQ.

Conal-Tuohy commented 3 years ago

@LucasHorseshoeBend in regard to your point about sequential numbering; yes the note numbers as given in the Word file should, I think, just be discarded by the transformation pipeline, which can easily renumber the entire sequence of notes (of both kinds) automatically.

The value of note numbers in the published web pages for supporting citation is a good point to keep in mind, too, thank you.

LucasHorseshoeBend commented 2 years ago

Thanks for the reminder via issue 54.

I had been busy doing other things and had forgotten that I was going to look at this using the extracted files. I'll add it to my list of things to do soon so I don't bypass it again. I will need to get my head around the Word bookmark feature which I have never used.

LucasHorseshoeBend commented 1 year ago

Handled editorially Closed