Collect multi-sign spans of damage markers.

rillian commented 5 years ago

In #5 I added code to convert ATF damage marks like šu2# to ⸢šu2⸣ using unicode half-brackets to present the typical rendering.

@willismonroe pointed out two problems with this.

There should be a single set of half-brackets for a sequence of damaged signs. I.e. ⸢šu-ru⸣-ub-ti not ⸢šu⸣-⸢ru⸣-ub-ti like the current code produces.
It would be better to use TEI markup to describe the damage. That means we need rendering in the viewer which is specific to cuneiform, but it makes clear to others what's intended by the markup.

So we should accumulate damaged sequences, just as we need to do with logograms, and use <damage> or <damageSpan> elements to describe them in the converted output. Looking at the TEI documentation for this, it seems like <damage><unclear> is what we want to represent half-brackets, at least where it's possible to use a hierarchical tag. Similarly, we could use <damage><supplied> for full-bracket restorations, and <damage><gap> for x or ... full-bracket sections.

Oracc uses <damageSpan> for half-brackets and <anchor type="breakStart"/>...<anchor type="breakEnd"/> for full-bracket sections.

rillian commented 5 years ago

Need to handle interaction with other decorations as well, like ina#?. The current code misses this.

willismonroe commented 5 years ago

what does "#?" refer to? Is it a hypothetical restoration of a damaged sign? I guess that would be the only real option, it's not questioning whether or not the sign is in fact damaged.

rillian commented 5 years ago

Damaged and uncertain reading, I guess? It's quite common in the corpus though.

cdli-gh / atf2tei

Collect multi-sign spans of damage markers. #6