Open emylonas opened 3 years ago
There are now three separate scripts in my fork of the iip-texts
at https://github.com/atbradley/iip-texts/tree/atb-dev/scripts/word-segmentation. I've reorganized some code to make it easier to reuse these scripts as components of a larger tool.
@emylonas needs to check this and then close.
As part of the word segmentation process, we will have to correct or add some
<w>
markup by hand to inscriptions with complicated features. Also, in the future, inscriptions may be amended or corrected, so that the segmented<div>
will change in order to mirror the changes in the transcription div. It would be very useful to be able to run a separate script to (re)generate the@xml:id
attributes.Files and Folders The original word segmentation script that does this is here: https://github.com/lukehollis/iip-word-lists/blob/master/word_segmentation/word_segmentation.py. l. 216
Folder that has files with word segmentation I'm not sure this is worth copying, but this is how it's done now.
Input and Results This new script should read in an inscription that has a
<div type="edition" subtype="transcription_segmented">
It should take the content of the<div type="edition" subtype="transcription_segmented">
and add an@xml:id
to each element in the div. These elements are likely to be<w>
,<num>
,<orig>
and<g>
.The
@xml:id
should be in the form@xml:id="IIPID-001"
where IIPID is the IIP number of the file. for ex. beth0345 (don't include the.xml
extension) followed by the number of the element in sequence in the div.Ex:
<w xml:id="beth0100.xml-04">
would be the 4th element in the div, for inscription beth0010.xmlNote that most inscriptions have names like this: caes0002.xml, but they can also appear in the from idum0003a.xml
If this script is written in XSLT it will be easier to run in Oxygen. if however, it is written in Python, then it can become part of a pipeline that is run on the command line.