Open cmroughan opened 1 year ago
Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.
Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):
caes0412.xml halu0001.xml jeru0237.xml hmti0003.xml seph0100.xml gers0001.xml dora0002.xml hmti0005.xml jeru0196.xml masa0038.xml masa0039.xml rehn0001.xml hmti0004.xml masa0037.xml jent0006.xml jeru0305.xml anri0001.xml knah0002.xml
There are also multiple cases where the '.xml' is being erroneously included in the wordID produced as part of this workflow. A nonexhaustive sample:
jeru0492.xml-1 jord0001.xml-229 masa0469.xml-2 jaff0054.xml-1 jord0001.xml-475 jord0001.xml-20 huqo0001.xml-10 jeru0492.xml-4 jeru0501.xml-1 beth0244.xml-5 beth0242.xml-3 masa0529.xml-1 qumr0001.xml-14 beth0243.xml-7 masa0416.xml-2 jord0001.xml-552 masa0493.xml-1 erra0001.xml-3 jeru0357.xml-2 qumr0001.xml-18 jord0001.xml-261
Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.
Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):
caes0412.xml halu0001.xml jeru0237.xml hmti0003.xml seph0100.xml gers0001.xml dora0002.xml hmti0005.xml jeru0196.xml masa0038.xml masa0039.xml rehn0001.xml hmti0004.xml masa0037.xml jent0006.xml jeru0305.xml anri0001.xml knah0002.xml
These, I believe, are mostly because they were redundant files and thus deleted or combined with another file, though I would need to check each case.
It appears that there are several XML files that went through passes of the word segmentation workflow at earlier stages which now preserve transcription_segmented divs that are not encoded to the current standard.
For example, jeru0183 :
<orig xml:id="jeru0183-7" xml:lang="arc"><foreign xml:lang="grc"><unclear>Κ</unclear></foreign><g ref="interpunct">·</g><foreign xml:lang="grc"><unclear>Ν</unclear>ΙΦ</foreign></orig>
The foreign tag should be removed and its xml:lang attribute moved to replace the xml:lang attribute in the enclosing orig tag.
To do: check other transcription_segmented divs for issues like this and push an update for these.