Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

Oddity in "title"; false positives for German #40

Closed LucasHorseshoeBend closed 3 years ago

LucasHorseshoeBend commented 7 years ago

I have been looking at "final" files selecting by "German". I then chose "1858" in data, and found 10 files (as of today, this might in principle change); scanning down the set of files that show up in the right hand pane shows some with no title, other than the standard suffix " ... [tei symbol]". None of these, for example http://vmcp.conaltuohy.com/xtf/view?docId=tei/1850-9/1858/58-10-07a-final.xml contains any non-English text. Any idea what is causing this false positive for German? Why no name? Are the two issues related?

Unless this is simple, leave it until pressing display problems, like underlines and alternate font indicating printed, rather than manuscript, components of a text are solved.

LucasHorseshoeBend commented 3 years ago

The "title" issue and the false positive for German are probably distinct issues. The title that displays in http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1850-9/1858/58-10-07a-final.xml is disconcerting, as it is counterintuitive to have to click upon [...] to open the file There are two others with the same feature in 1858 files: http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1850-9/1858/58-11-00-final.xml and http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1850-9/1858/58-12-00a-final.xml

but I can't pick up a common feature of those files in Word. There are eight 1858 files coded German that do have German text that behave as expected. The three identified above are the only files in 1858 where this issue arises.

I did spot checks: 1840s folders, no such cases when German selected, nor when English selected

1868 folder, one case of the problem when faceted as English, and is in English, http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1860-9/1868/68-12-06-draft.xml BUT the same file appears with same "title" when I select the in German facet!!

So there is something odd about the recognition of languages also.

LucasHorseshoeBend commented 3 years ago

The resaving did not eliminate the "title" as [...] the four files all report as being both English and German I do not understand this behaviour.

Conal-Tuohy commented 3 years ago

Thanks for your detective work, here @LucasHorseshoeBend ; based on your analysis I have got to the bottom of the missing title and the language mis-classification, and they are indeed related, as you surmised.

The immediate cause of the problem in 58-10-07a-final is that the letter contains an empty trailing paragraph which is styled with the style t-letter.

The chain of events leading from that empty paragraph to the text being classified as German and also having a missing title is this:

  1. The paragraphs whose style names begin with t- are assumed to be translations (into English), and the remainder of the letter is, by elimination, assumed to be in German. So that single blank t-letter para is what causes all the text of the letter to be tagged as German.
  2. Because XTF irritatingly requires every TEI document to have a title, which these letters don't generally have, my pipeline has a step which adds a heading which it derives from the opening lines of the letter (an "incipit") suffixed with an ellipsis. In order to give the documents English language titles, the incipit includes only text which has been tagged as English, but in this case the only English language paragraph is that empty t-letter paragraph, so the incipit also ends up empty.

I think that I can fix the whole thing by a small tweak to the translation-recognition step so it ignores paragraphs which are styled as a translation if they are empty.

LucasHorseshoeBend commented 3 years ago

Thanks Conal

A tweak if easy to do would be a welcome safety net.

Now that I know what to look for I will be able to correct the issues when found. Interestingly the empty trailing t-letter paragraph in 58-10-07a did not show up in the Word style mapping tool I use to pick up departures from the agreed styles.

I have put a copy of 58-10-07a into the quarantine folder, and named it 58-10-07a-Title Prob for you to use as a test file.

I have looked to see whether I can pick up issues where the language designation is falsely "German" but without the "title" problem. I have not yet tried to exhaust the set, but I have found one quickly, a copy of which I have also put into quarantine with the same empty t-letter paragraph fault left intact: 85-09-21e-False German, but I have corrected it in the main set.

A second case was more interesting, where an empty t-letter paragraph had been used to insert spaces below headings, rather than using the standard extra space style we created for such usage. It is detected in the facets correctly as German, but has a proper title. I have now replaced those empty t-letter paragraphs by paragraphs coded as extra space. A copy of that file before correction is also in the quarantine folder, as 88-02-07 incorrect t-letter usage

Because I would like to have the files as clean as possible, I have removed the offending empty paragraphs in my example cases:

58-10-07a-final 58-11-00-final
58-12-00a-final
68-12-06-draft

85-09-21e 88-02-07

Best wishes Arthur

On 23 Feb 2021, at 05:41, Conal Tuohy notifications@github.com wrote:

Thanks for your detective work, here @LucasHorseshoeBend https://github.com/LucasHorseshoeBend ; based on your analysis I have got to the bottom of the missing title and the language mis-classification, and they are indeed related, as you surmised.

The immediate cause of the problem in 58-10-07a-final is that the letter contains an empty trailing paragraph which is styled with the style t-letter.

The chain of events leading from that empty paragraph to the text being classified as German and also having a missing title is this:

The paragraphs whose style names begin with t- are assumed to be translations (into English), and the remainder of the letter is, by elimination, assumed to be in German. So that single blank t-letter para is what causes all the text of the letter to be tagged as German. Because XTF irritatingly requires every TEI document to have a title, which these letters don't generally have, my pipeline has a step which adds a heading which it derives from the opening lines of the letter (an "incipit") suffixed with an ellipsis. In order to give the documents English language titles, the incipit includes only text which has been tagged as English, but in this case the only English language paragraph is that empty t-letter paragraph, so the incipit also ends up empty. I think that I can fix the whole thing by a small tweak to the translation-recognition step so it ignores paragraphs which are styled as a translation if they are empty.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/40#issuecomment-783924434, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTWRW5RG65HQXH25YFLTAM52PANCNFSM4DIUNJCQ.

Conal-Tuohy commented 3 years ago

Cheers @LucasHorseshoeBend; since it was a trivial tweak I just went ahead and did it.

LucasHorseshoeBend commented 3 years ago

Thanks

Best wishes Arthur

On 24 Feb 2021, at 06:03, Conal Tuohy notifications@github.com wrote:

Closed #40 https://github.com/Conal-Tuohy/VMCP-upconversion/issues/40.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/40#event-4368994733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTSWIEED4LXAIL3ZWI3TASJDVANCNFSM4DIUNJCQ.