Closed codykingham closed 4 years ago
This issue can be solved by making a separate configuration dict for this file in convert.py
.
The result will be that we lose any potential emphasis in this file. But there does not appear to be any valid ones anyway.
Done. bar text A37-A40
has been fixed by ignoring emphasis is that document. There do not appear to be foreign terms marked in the document. So this is not a problem.
The source file
bar text A37-A40.html
has several paragraphs that are incorrectly emphasized in the html to .nena conversion:For example, in
The Tale of Nasimo
, lines 1–3 become emphasized, even though they are not emphasized in the original.doc
file:This seems to be caused by the export from
.doc
to.html
, which has removed the default italic formatting from these lines. Most of the NENA.doc
files use italics as the default formatting with regular font to indicate "emphasis" (e.g. non-nena words like proper nouns). This is also true ofbar text A37-A40.html
. Opening the file in MS word, I have confirmed that lines 1-3 are indeed formatted with italics. But on export, MS Word strips out this formatting for these lines, causing them to be read as unformatted, and hence "emphasis". I have also reproduced this behavior with a fresh html export.One possible cause may be that lines 1–3 are marked as US English:
whereas the rest of the document is encoded as
en-GB
. However, this is true of other files (e.g.bar text a26.html
), which do not suffer from this problem. So this may be caused by a more obscure formatting setting in this particular file.I am not yet sure where else this problem may occur.