Open emylonas opened 3 years ago
The current word segmenting script in https://github.com/atbradley/iip-texts/tree/atb-dev/scripts/word-segmentation does this with jeru0522:
<div type="edition" subtype="transcription_segmented"><p><w>Ἰοῦστος</w> <w>Χαλχιδηνός</w></p>
<p><w>Θεέννας</w></p>
</div>
Do we want to drop the <p>
tags? keep the <div type="textpart">
s?
word_indexer.py
currently drops the second <p>
--I'm trying to work out why now.
The python script that does word segmentation currently looks for
//div[@subtype="transcription"]/p
and applies the word segmentation rules to the text and element nodes inside that<p>
element.However, there are some inscriptions that have multiple texts on them or have texts on more than one part of the object. In this case , the structure of the transcription
div
is as follows://div[@subtype="transcription"]/div[@type="textPart"]/p
where there is more than one textPart. For ex caes0509.xml:Other examples: jeru0522.xml, mare0437
The script currently locates and segments the contents of the
<p>
in the first textPart. It etiher converts or ignores any subsequent ones, but only writes out the first one in the segmented output.The script should convert and output each of the textPart divs.
Python script folder with output files
Will add example output - current and desired