Errors from earlier word segmentation runs that need updating

Brown-University-Library / iip-texts

IIP inscriptions encoded in Epidoc XML and supporting files

6 stars 11 forks source link

Errors from earlier word segmentation runs that need updating #195

Open cmroughan opened 1 year ago

cmroughan commented 1 year ago

It appears that there are several XML files that went through passes of the word segmentation workflow at earlier stages which now preserve transcription_segmented divs that are not encoded to the current standard.

For example, jeru0183 : <orig xml:id="jeru0183-7" xml:lang="arc"><foreign xml:lang="grc"><unclear>Κ</unclear></foreign><g ref="interpunct">·</g><foreign xml:lang="grc"><unclear>Ν</unclear>ΙΦ</foreign></orig>

The foreign tag should be removed and its xml:lang attribute moved to replace the xml:lang attribute in the enclosing orig tag.

To do: check other transcription_segmented divs for issues like this and push an update for these.

cmroughan commented 1 year ago

Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.

Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):

caes0412.xml halu0001.xml jeru0237.xml hmti0003.xml seph0100.xml gers0001.xml dora0002.xml hmti0005.xml jeru0196.xml masa0038.xml masa0039.xml rehn0001.xml hmti0004.xml masa0037.xml jent0006.xml jeru0305.xml anri0001.xml knah0002.xml

cmroughan commented 1 year ago

There are also multiple cases where the '.xml' is being erroneously included in the wordID produced as part of this workflow. A nonexhaustive sample:

jeru0492.xml-1 jord0001.xml-229 masa0469.xml-2 jaff0054.xml-1 jord0001.xml-475 jord0001.xml-20 huqo0001.xml-10 jeru0492.xml-4 jeru0501.xml-1 beth0244.xml-5 beth0242.xml-3 masa0529.xml-1 qumr0001.xml-14 beth0243.xml-7 masa0416.xml-2 jord0001.xml-552 masa0493.xml-1 erra0001.xml-3 jeru0357.xml-2 qumr0001.xml-18 jord0001.xml-261

zeichman commented 1 year ago

Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.

Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):

caes0412.xml halu0001.xml jeru0237.xml hmti0003.xml seph0100.xml gers0001.xml dora0002.xml hmti0005.xml jeru0196.xml masa0038.xml masa0039.xml rehn0001.xml hmti0004.xml masa0037.xml jent0006.xml jeru0305.xml anri0001.xml knah0002.xml

These, I believe, are mostly because they were redundant files and thus deleted or combined with another file, though I would need to check each case.