Brown-University-Library / OLD-ARCHIVED_iip-production

3 stars 9 forks source link

Modify python word segmentation script so that it handles <div>s with @textParts #132

Open emylonas opened 3 years ago

emylonas commented 3 years ago

The python script that does word segmentation currently looks for //div[@subtype="transcription"]/p and applies the word segmentation rules to the text and element nodes inside that <p> element.

However, there are some inscriptions that have multiple texts on them or have texts on more than one part of the object. In this case , the structure of the transcription div is as follows:

//div[@subtype="transcription"]/div[@type="textPart"]/p where there is more than one textPart. For ex caes0509.xml:

          <div type="edition" subtype="transcription" ana="b1">
                <div type="textpart" subtype="obverse">
                    <p>βονόσου</p>
                </div>
                <div type="textpart" subtype="reverse">
                    <p><foreign xml:lang="lat">Bonosu</foreign></p>
                </div>
            </div>

Other examples: jeru0522.xml, mare0437

The script currently locates and segments the contents of the <p> in the first textPart. It etiher converts or ignores any subsequent ones, but only writes out the first one in the segmented output.

The script should convert and output each of the textPart divs.

Python script folder with output files

Will add example output - current and desired

atbradley commented 3 years ago

The current word segmenting script in https://github.com/atbradley/iip-texts/tree/atb-dev/scripts/word-segmentation does this with jeru0522:

<div type="edition" subtype="transcription_segmented"><p><w>Ἰοῦστος</w> <w>Χαλχιδηνός</w></p>
<p><w>Θεέννας</w></p>
</div>

Do we want to drop the <p> tags? keep the <div type="textpart">s?

word_indexer.py currently drops the second <p>--I'm trying to work out why now.