erc-dharma / tfc-nusantara-epigraphy

DHARMA project task force C, Nusantara epigraphic corpus
https://dharma.hypotheses.org/
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

supplying low-level punctuation #69

Open arlogriffiths opened 1 month ago

arlogriffiths commented 1 month ago

I believe we have lots of files with <supplied reason="omitted">,</supplied> to insert low-level punctuation signs into our editions where we think the textual structure requires them though they are not engraved.

Having looked at some of @danbalogh's encodings, I think what we mean actually out to be represented with <supplied reason="subaudible">,</supplied>.

@danbalogh : can you confirm? (Never mind our use of a typed comma which you may find non-compliant with EGD or TG rules.) @michaelnmmeyer : if Dan confirms, can you batch replace all our cases of <supplied reason="omitted">,</supplied>?

danbalogh commented 1 month ago

The general rule (EGD §6.3.6) is indeed that editorial punctuation should be inserted using <supplied reason="subaudible">. BUT the only permitted contents of this is . full stop, so I can't comply with "never mind". The comma is not a legitimate character in our transliteration scheme, but shorthand for a particular kind of grapheme encoded with <g>. Supplied punctuation is interpretive, thus you cannot supply graphemes. As for levels of punctuation, after much discussion a long time ago we have agreed not to distinguish them, and this applies both to the encoding of symbols with <g> and to the supplying of editorial punctuation. So, if you want a batch replace, it should be to <supplied reason="subaudible">.</supplied>.

arlogriffiths commented 1 month ago

The kinds of cases we're dealing with are inscriptions with dozens or hundreds of low-level punctuation signs that look like a median dot or apostrophe, depending on the inscription. There is no doubt that if a scribe would have thought of using a punctuation sign where we feel it helps to read one, then the same synbol would have been used. Nevertheless I feel it is incorrect to claim that the scribe omitted it, becaude consistency on such points was not a value in his culture as it is in ours. Hence I don't understand why <supplied reason="subaudible">,</supplied> as shorthand for <supplied reason="subaudible"><g type="comma">.</g></supplied> should not be allowed. I don't understand the logic behind your sentence: "Supplied punctuation is interpretive, thus you cannot supply graphemes". I really cannot accept seeing ⟨.⟩ in our editions as I myself don't understand your logic and cannot expect any of our readers will do so more than I do myself. In a text richly punctuated with <g type="comma">.</g> displayed as , any reader will be more confused than helped if supplied punctuation of the same level is not shown as ⟨,⟩.

danbalogh commented 1 month ago

I'll try to put this more clearly, which means I'll have to be more verbose. Our current guidelines, which you and I must have agreed on in or before August 2020, say that "although many earlier editors supply two levels of punctuation (daṇḍa and double daṇḍa), our practice shall be to use only one kind of supplied punctuation". This one kind is <supplied reason="subaudible">.</supplied>. The use of a comma in this context is not permitted by the current EGD. I am unable to find any record of a discussion between you and me on this, but the section "Revisiting Punctuation Encoding (09-03-2020)" in the Leftovers has some background from before we revised symbol encoding in general, and supplied punctuation has many ties to that.

Anyway, let me try and explain.

In our editions (as representations/models of the real world), symbols feature in three ways: as TEI encoding, as shorthand for the TEI encoding, and as a display visualisation. The encoding is primary; the shorthand is a human-friendly substitute for encoding that is meant to be converted to encoding, while the display is a rendering of the encoding according to transformation rules, which are subject to change. In our texts (the real world), symbols have a physical shape and (presumably) a function. The shape is what I referred to above as a grapheme, which was inaccurate. "Graph" may perhaps be the correct term, or we can stick to glyph. In many subcorpora, there is a fairly consistent association between shapes and functions (e.g. "punctuation signs that look like a median dot or apostrophe" always mean low-level punctuation, and low-level punctuation is always denoted by such signs), but this is not necessarily the case within all subcorpora, and definitely not the case across the entirety of the DHARMA corpus. It is thus necessary to distinguish shape from function.

When rendering the real texts into an edition, our present approach to symbols is to encode the shape of every symbol using attributes on the <g> element (since the shape is a physical characteristic of the actual inscribed glyph), and to encode the function of every symbol using the content of that element. The function may, in our classification ("ontology"), be punctuation (encoded by putting a full stop in the <g> tag), space-filling (encoded by putting one or more § signs in the <g> tag) or "other" (encoded by an empty <g> tag). A finer classification could have been devised, but we decided not to do so, to a large part because any finer classification will come with a greatly increased number of fuzzy cases where it is difficult or impossible to decide on the classification of any particular real-world instance. Thus, we only have a "punctuation" function and no separate "low-level punctuation" and "high-level punctuation" categories. (Actually, we do distinguish two levels of punctuation if and only if a digital edition is encoded from a printed edition without access to visual material, and that printed edition uses two levels of punctuation such as single and double daṇḍas. This is at the end of EGD §4.2.4, and I only mention this for the sake of completeness; it need not complicate the present discussion.)

What this means is that when you encode an actual punctuation mark in your texts using a comma, then

If that is clear so far, then we can now proceed to supplied punctuation. When I said "supplied punctuation is interpretive", I meant that the supplying of punctuation is an act of interpretation: what I as editor express by adding a punctuation mark at some point is that "I interpret the text in such a way that there is a semantic break here", and not that "I believe the scribe should have inscribed a particular symbol here, but forgot to". This is what we express by using subaudible rather than omitted. I think you agree with me on this, since you express similar thoughts above. But this means that you are supplying an abstract idea, that of a semantic break, and not a concrete glyph. Since we use <g> to encode concrete glyphs, <g> (regardless of its attributes and content) should never be the content of <supplied reason="subaudible">. When you are encoding received symbols, there is only one kind of abstract punctuation in our ontology, which can be expressed by any glyph. It would therefore be inconsistent to distinguish two (or more) kinds of supplied punctuation. What would be even worse, however, is to do what you seem to be arguing for: essentially to hack the system, exploiting a) the incidental resemblance of your particular glpyhs to a modern comma; and b) the likewise incidental display implementation which renders glyphs of a particular shape as commas - to achieve something that is intuitively transparent for your target audience.

And that, I think, is an important point here: you are approaching this in terms of presentation and intuitive readability, whereas I am approaching it in terms of modelling, ontology and encoding. I don't mean to say that presentation and intuitive readability are unimportant, but tweaking our encoding practice to achieve a certain look is definitely not a good idea. It is of course possible to revise the encoding practice to achieve the desired results, but that is a different matter, and it comes with its own set of complications. For this, it would be useful if you thought in broader terms about what you want to achieve here. Do you want to permit only <supplied reason="subaudible"><g type="comma">.</g></supplied> but no other kinds of <g> in supplied subaudible? Would this be optional throughout DHARMA or mandatory in some cases? What would be the rationale for calling <g> (a graph/glyph) "subaudible" (i.e. supplied for editorial interpretation)? How would you define the scope of texts for which it is recommended or mandatory? Would it be based on the graphic appearance of the symbols? Some inscriptions use symbols best encoded as . for sentence-level punctuation. Would the encoders of such inscriptions then have to use <supplied reason="subaudible"><g type="comma">.</g></supplied> instead of <supplied reason="subaudible">.</supplied>, and to recheck already encoded inscriptions and change them to this? Or would it be based on punctuation level? In that case we'd be confounding symbol appearance (<g type="comma">) with function. Or a combination of shape and punctuation level? That looks even more like special pleading, the application of which may be straightforward in your texts, but not project-wide. Any of these options would be what I see as hacking the system. Or do you want to revise the entire system of symbol encoding from the bottom up, removing the rigorous distinction we now have between appearance and function? I really hope not... Or do you want to expand the system by introducing two levels of punctuation? We could perhaps do that (either by using the TEI tag <pc>, which we considered and rejected, or by returning to the use of "." and ".." for lower- and higher-level punctuation (which we entertained for a long time, but finally retained only for the above-mentioned case of working from a printed edition). But whatever we do to introduce punctuation levels into the scheme, I'm sure it would both require rechecking many already encoded texts and involve many fuzzy cases where an arbitrary decision will have to be made. So I would not be happy to do so at this stage.

Now to come back to your "I really cannot accept seeing ⟨.⟩ in our editions as I myself don't understand your logic and cannot expect any of our readers will do so more than I do myself. In a text richly punctuated with <g type="comma">.</g> displayed as , any reader will be more confused than helped if supplied punctuation of the same level is not shown as ⟨,⟩."

I hope that you do understand my logic now. I would be happier if you did not call it "my logic", because it's the inherent logic of an encoding scheme devised with your contribution and approval. But more importantly, I see a couple of ways out. For one thing, the confusion - in your understanding too, as expressed here - arises from the similarities between display implementation, original glyph shape and the function of the modern comma. But as I stressed above, these similarities are purely incidental. So one way out could be to change the display of <g type="comma"/> (regardless of whether it contains a . or is empty, i.e. whether the editor interprets it as punctuation or not) to something else (never mind what; I'm sure we could find something; let's call it symbol X for now). That way, it would be clear to readers that symbol X in our displayed editions represents a glyph of a particular shape, while ⟨.⟩ in the same displays represents editorial punctuation, regardless of punctuation level and without implying any particular glyph. I assume you would not be happy to do so, because using a comma is established practice in SEA studies (I assume), but I do think you should seriously consider this solution, because it is the one that best fits the overall picture and requires the least change in current practice, even if it produces a display that is not traditional in your neck of the woods. Another way out would be to revert to <supplied reason="omitted"> in such cases. I know that I did this in Siddham, and it may have been written up in some earlier version of the EGD that we have since then discarded: if a particular inscription employs particular punctuation marks with fair consistency (judged at the editor's discretion), then and only then the absence of a punctuation mark at a point where the editor would expect one is, after all, to be interpreted as scribal omission and encoded as such. This would permit any encoder to use any kind of <g> (any type and any content) in <supplied reason="omitted"> (NB: not subaudible). There is also a third option, which is to introduce - as an option - multiple levels of editorial punctuation. That is to say, in addition to <supplied reason="subaudible">.</supplied>, we could allow commas (and once we are there, perhaps even question marks, exclamation marks, colons, semicolons and quotation marks) as the contents of this tag. All of these would be displayed in angle brackets (and I presume hidden in physical display), so they would remain clearly editorial. These would strictly not be tagged with <g>, because we aren't using them to encode a graph, but to encode an editorial interpretation. The sole problem I see with this is that we display <g type="comma"/> as commas, and the intuitive implication would then be that the editor is supplying a particular graph when in fact they are supplying a particular kind of semantic break. But this would be more of an advantage than a problem in your texts, and I do not foresee much confusion arising from it elsewhere. Finally, I see an alternative way to implement either the second or the third of the above solutions. We could decide to discard <supplied reason="subaudible"> for punctuation, and instead use <reg>. At the moment, we never use <reg> outside of <choice> with <orig>, but as far as I understand, the use of <reg> is perfectly sanctioned by TEI for supplying punctuation. In fact, it seems to me to be a better choice than <supplied reason="subaudible">, which we have adopted from EpiDoc usage, but which was in my opinion meant in TEI for transcriptions of audio, and not for written texts. We could do the same for supplied avagrahas, i.e. change from supplied subaudible to reg. The display of reg already involves angle brackets (plus blue colour), and we could keep it the same. So, if you like this, we could go for reg instead of supplied omitted in the second option, or instead of supplied subaudible in the third. All in all, I think I like this last best. We would then get rid of the cheesy <supplied reason="subaudible"> in our encoding, permit many optional kinds of editorial punctuation in our files, display them in a way we both like, and get all this with a fairly simple revision of the EGD and a straightforward automated replace of existing code in our files. Let me know what you think.

danbalogh commented 1 month ago

I seem to have killed this discussion with my long reply. That was not my intention. At any rate, if we pick it up again, I want to note that regardless of which (if any) of the above solutions we finally adopt, I think it would make sense to change display of <g type="comma"/> to something other than a comma. I have revised a Vākāṭaka plate from Siddham to DHARMA standards (for the article on Prinsep and modern epigraphic methods in JRAS). In this text, punctuation marks generally follow a scheme where double full-length daṇḍas mark the ends of major sections, single half-length daṇḍas are sort of sentence-level, though not quite consistent, and short dashes are for separations below the sentence level. I find it very annoying in display that the (typically sentence-level) half-daṇḍas appear as commas. So it's a concrete illustration of what I mentioned on a purely theoretical level above.