Split akṣaras, placeholders - use markup instead?

I'm having second thoughts about our system of using the left and right ceiling characters ⌈ and ⌉ as placeholders for split-off parts of akṣaras before or after an interruption, old EGD §4.1.4. My main problem with it is that it is idosyncratic and complicated, especially when unclearness comes into play on one half or the other. It is a bit difficult even for me who designed it, and will not at all be transparent to any end users. We can of course hope to establish a new convention with it, but is there really a need for that? What I propose instead is the following.

On the transliteration level, we ignore the phenomenon. If the intervening feature is present in the transliteration (e.g. line break or a _ for space), then we transliterate the entire akṣara on the side where the original has the consonant body (or, in case of a split consonant body, pick a side arbitrarily).
On the encoding level, we wrap the transliterated akṣara in <seg type="aksara" part="I"> if it's transliterated before the break, and <seg type="aksara" part="F"> if it's transliterated after the break. This would signify that only the initial/final part of the akṣara is at the place where the transliterated characters have been placed. By implication, the other part is on the other side of the interruption, but nothing is explicitly encoded there.
- If unclearness (unclarity??) is involved, we simply put <unclear> around whichever transliterated characters are affected. We do not need the complicated present system where the unclear tags must in some cases be added to the placeholder ceilings too.
On the display level, we either do nothing (the rationale being that any display would be quite opaque to the end user), or we come up with an ad hoc solution, e.g. coloured text and a tooltip, or auto-display a ceiling character (plus tooltip) for <seg type="aksara"> with @part (if part="F", then a left ceiling ⌉ to the segment's left; if part="I", then a right ceiling ⌈ to the segment's right).

Switching to this solution would add some extra code to our texts where unclear is not involved, but not increase the amount of code (and reduce its complexity) when unclear is mixed in. In addition to the slight gain in overall code simplicity, it would be a solution that harmonises better with the rest of our encoding: while the ceilings are completely idiosyncratic, we do already have uses for <seg type="aksara"> and for @part. As an added advantage, we could then generalise this solution to the kind of case covered in #277 (akṣara randomly interrupted by a gridlike milestone such as a crack - a situation similar, but not identical, to akṣaras deliberately split up into parts by the scribe), and perhaps also to situations where part of an akṣara is lost in a lacuna, where our recommendations for tagging akṣara components (which I've always found too complex to be really useful) could be replaced by part = I / M / F. (This would need further thought, but I think it is workable, with noticeable gain in simplicity and very little loss in the accuracy of our encoding.)

I don't think there is a very large number of already encoded texts in our corpus with these ceiling characters present. So, if @arlogriffiths and @manufrancis approve of the new solution, then @michaelnmmeyer could generate for me a list of files that contain either of these special characters, and I could change the code to the new system - contacting the file's original encoder if and only if I cannot make sense of how s/he used the ceiling characters.

erc-dharma / project-documentation

Split akṣaras, placeholders - use markup instead? #336