Divergent-Discourses / TibNorm

Normalising Tibetan Text
1 stars 0 forks source link

Paragraph identification #12

Open ykyogoku opened 9 months ago

ykyogoku commented 9 months ago

"Since we would like to keep paragraphs but get rid of line breaks, we need to find a way of identifying paragraphs which is not possible from the text itself. One approach was to annotate paragraphs during the layout analysis. Which can probably be achieved easily.

However, exporting into txt files does not contain any layout annotations and both line breaks and paragraphs are indicated by the same mark.

One solution it appears would be to extract the text of each paragraph from the xml output. Transkribus offers both PAGE xml and ALTO xml. In ALTO all text regions are marked with the tag (or whatever this is called) “textBlock”. This includes real paragraphs but also page numbers, captions, headings etc. However, text in all these text regions we would probably like to see as grouped into one paragraph without line breaks each.

In PAGE all text is generically tagged but each region is further defined as “paragraph”, “page number” etc. the way we tagged it."

ykyogoku commented 8 months ago

Integrate James script into this project.