Brown-University-Library / OLD-ARCHIVED_iip-production

3 stars 9 forks source link

Create separate script to add IDs to the elements in the divs that have word segmentation #131

Open emylonas opened 3 years ago

emylonas commented 3 years ago

As part of the word segmentation process, we will have to correct or add some <w> markup by hand to inscriptions with complicated features. Also, in the future, inscriptions may be amended or corrected, so that the segmented <div> will change in order to mirror the changes in the transcription div. It would be very useful to be able to run a separate script to (re)generate the @xml:id attributes.

Files and Folders The original word segmentation script that does this is here: https://github.com/lukehollis/iip-word-lists/blob/master/word_segmentation/word_segmentation.py. l. 216

Folder that has files with word segmentation I'm not sure this is worth copying, but this is how it's done now.

Input and Results This new script should read in an inscription that has a <div type="edition" subtype="transcription_segmented"> It should take the content of the <div type="edition" subtype="transcription_segmented"> and add an @xml:id to each element in the div. These elements are likely to be <w>, <num>, <orig> and <g>.

The @xml:id should be in the form @xml:id="IIPID-001" where IIPID is the IIP number of the file. for ex. beth0345 (don't include the .xml extension) followed by the number of the element in sequence in the div.

Ex: <w xml:id="beth0100.xml-04"> would be the 4th element in the div, for inscription beth0010.xml

Note that most inscriptions have names like this: caes0002.xml, but they can also appear in the from idum0003a.xml

If this script is written in XSLT it will be easier to run in Oxygen. if however, it is written in Python, then it can become part of a pipeline that is run on the command line.

atbradley commented 3 years ago

There's a script here that does this. There's sample output here.

My thinking at this point is this can be part of a single command-line tool that handles all the NLP tasks.

atbradley commented 3 years ago

There are now three separate scripts in my fork of the iip-texts at https://github.com/atbradley/iip-texts/tree/atb-dev/scripts/word-segmentation. I've reorganized some code to make it easier to reuse these scripts as components of a larger tool.

emylonas commented 3 years ago

@emylonas needs to check this and then close.