Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

"Standard" style should not be present #20

Closed LucasHorseshoeBend closed 7 years ago

LucasHorseshoeBend commented 7 years ago

In your e-mail 13 Dec 14.27 you wrote

Having excluded the empty paragraphs from the count, I've reduced the number of documents with "standard"/"normal" paragraphs by 50%, which, although a definite improvement, is still (I would think) too large a number to be manually corrected. However, I think some kind of correction will be needed, since a quick random survey of the "standard" texts shows almost all of them have paragraphs which are clearly intended to have some special significance but are still styled "standard". So I am tending towards the idea that it will indeed be necessary to automatically replace the "standard" style of those paragraphs, based on their direct formatting and content.

It does look like it will be necessary. If that fails can you further segment by selecting the XML files with -final as the suffix of the file name? If there is a smallish number of those, then we can tackle those manually and keep a good watch on the others as we set them at final. There are at least 26 of them at Final, as we have done all up to 1859; Mentions is a mess as yet, with mostly first drafts, so tI suspect that most of the 118 that include standard style will need working in in any case.

Conal-Tuohy commented 7 years ago

I have added yet another facet "Status" with values "final" and "not final" based on the file name suffix. I'm running the conversion now as I head off to bed and will check it in the morning, but assuming I didn't make a typo you should be able to filter documents by their status (and combine that selection with the other facets of course)

Conal-Tuohy commented 7 years ago

I will need to create a step that assigns styles by looking for paragraphs that have a particular combination of formatting; font family, weight and obliquity, point size, text alignment, indent, etc.

Do this first and then if necessary refine it further to look for patterns in the text itself or in the relative placement of the paragraphs in the text (correspondent for instance is always near the top, and plant names apparently always at or near the end).

LucasHorseshoeBend commented 7 years ago

The final facet works a treat, and the number of positives is about what I would expect it to be, given that as of today we had 4,893 so named, and you are working on an edition a few weeks old. But unfortunately the total of 214 final items with standard is a bit too high for manual edits.

LucasHorseshoeBend commented 7 years ago

I have done these edits manually in the "final" files, and now know what accounts for the majority of them: our style template did not force the paragraphs following a valediction into the letter style that should have been used for such post-script paragraphs, and it was often overlooked.

The issue can I think be closed

LucasHorseshoeBend commented 7 years ago

Do you really want this issue to remain open? I can deal with the other cases as we set the files concerned to final.