CambridgeSemiticsLab / nena_corpus

The NENA corpus in plain-text markup
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

Multi-Line Emphasis in bar text A37-A40 #1

Closed codykingham closed 4 years ago

codykingham commented 4 years ago

The source file bar text A37-A40.html has several paragraphs that are incorrectly emphasized in the html to .nena conversion:

For example, in The Tale of Nasimo, lines 1–3 become emphasized, even though they are not emphasized in the original .doc file:

(1) *ʾíθwa xa-bráta šə́mma Nasìmo.ˈ Nasìmoˈ wíðɛwa gu-ʾúmra qurbàna,ˈ qa-ṱ-azáwa
qarwàwa.ˈ yomáθa la-palṭàwa,ˈ ma-ṱ-áwa xámθa šapìrta,ˈ káwsa làxxa ṣaléwa.ˈ
ʾiθwála šáwwa xonăwàθa.ˈ mə́ra lá-palṭa xáθən mən-bɛ̀θa.ˈ ʾàxni nablə́xla.ˈ
(2) ʾáni zìlla.ˈ lá-θela jàldeˈ murqə̀lla.ˈ mə́ra xéna ʾána qɛ́mən ʾázən ʾùmra.ˈ
ʾánna xonăwáθa là-θela.ˈ ʾazáwa ʾúmra qa-t-qarwàwa,ˈ dášta ṃaḷyáwa rakáwe
ʾarabàye.ˈ ṃḷíθa dášta ʾarabàye,ˈ ʾə́θyela jlíwəlla muṣə̀lyəlla.ˈ
(3) ʾə́θyɛle xóna díya ʾo-gòṛa,ˈ lɛ́le xə́zyəlla xàθe,ˈ dax-xànum ʾay-tíwta
gu-bɛ́θa.ˈ lá qəm-xazèla.ˈ mə̀re*:ˈ

This seems to be caused by the export from .doc to .html, which has removed the default italic formatting from these lines. Most of the NENA .doc files use italics as the default formatting with regular font to indicate "emphasis" (e.g. non-nena words like proper nouns). This is also true of bar text A37-A40.html. Opening the file in MS word, I have confirmed that lines 1-3 are indeed formatted with italics. But on export, MS Word strips out this formatting for these lines, causing them to be read as unformatted, and hence "emphasis". I have also reproduced this behavior with a fresh html export.

One possible cause may be that lines 1–3 are marked as US English:

41    <p lang="en-US"

whereas the rest of the document is encoded as en-GB. However, this is true of other files (e.g. bar text a26.html), which do not suffer from this problem. So this may be caused by a more obscure formatting setting in this particular file.

I am not yet sure where else this problem may occur.

codykingham commented 4 years ago

This issue can be solved by making a separate configuration dict for this file in convert.py.

codykingham commented 4 years ago

The result will be that we lose any potential emphasis in this file. But there does not appear to be any valid ones anyway.

codykingham commented 4 years ago

Done. bar text A37-A40 has been fixed by ignoring emphasis is that document. There do not appear to be foreign terms marked in the document. So this is not a problem.