erc-dharma / project-documentation

DHARMA Project Documentation
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Split akṣaras, placeholders - use markup instead? #336

Open danbalogh opened 2 weeks ago

danbalogh commented 2 weeks ago

I'm having second thoughts about our system of using the left and right ceiling characters ⌈ and ⌉ as placeholders for split-off parts of akṣaras before or after an interruption, old EGD §4.1.4. My main problem with it is that it is idosyncratic and complicated, especially when unclearness comes into play on one half or the other. It is a bit difficult even for me who designed it, and will not at all be transparent to any end users. We can of course hope to establish a new convention with it, but is there really a need for that? What I propose instead is the following.

Switching to this solution would add some extra code to our texts where unclear is not involved, but not increase the amount of code (and reduce its complexity) when unclear is mixed in. In addition to the slight gain in overall code simplicity, it would be a solution that harmonises better with the rest of our encoding: while the ceilings are completely idiosyncratic, we do already have uses for <seg type="aksara"> and for @part. As an added advantage, we could then generalise this solution to the kind of case covered in #277 (akṣara randomly interrupted by a gridlike milestone such as a crack - a situation similar, but not identical, to akṣaras deliberately split up into parts by the scribe), and perhaps also to situations where part of an akṣara is lost in a lacuna, where our recommendations for tagging akṣara components (which I've always found too complex to be really useful) could be replaced by part = I / M / F. (This would need further thought, but I think it is workable, with noticeable gain in simplicity and very little loss in the accuracy of our encoding.)

I don't think there is a very large number of already encoded texts in our corpus with these ceiling characters present. So, if @arlogriffiths and @manufrancis approve of the new solution, then @michaelnmmeyer could generate for me a list of files that contain either of these special characters, and I could change the code to the new system - contacting the file's original encoder if and only if I cannot make sense of how s/he used the ceiling characters.