marking words for lexicographical harvesting

arlogriffiths commented 3 years ago

Dear Axelle and Daniel,

While encoding Old Javanese inscriptions (or other texts), several of us have the habit to use * s prefix before words on which we are able to make some new lexicographical observations (whether the word is entirely unrecorded in the standard dictionary, or whether the specific derived from is unrecorded, or whether the dictionary definition of the meaning needs to be extended/revised). Just now Kunthea has asked me what do do when she encounters such cases in the Khmer corpus, and I advised her to use the asterisk.

But perhaps we can use proper TEI encoding to identify such cases? What about a <w> tag with @type="lexical_novelty" or something like that? Although sandhi is basically absent in Old Khmer, and not extremely common in Old Javanese, there may still be cases where word boundaries are rendered invisible due to sandhi and whether use of <w> would require an aribtrary decision on the part of the encoder. Here is a fictive example:

<w @type="lexical_novelty">kaparǝkanyā</w>taḥ deniṅ āgama, ya hika deśadr̥ṣṭa ṅaranya

where kaparǝkanyātaḥ is the result of sandhi between kaparǝkanya and ataḥ. Maybe we could opt for empty <w> tags and encode such a cas like this?

<w @type="lexical_novelty" lemma="kaparǝkanya"/>kaparǝkanyātaḥ deniṅ āgama, ya hika deśadr̥ṣṭa ṅaranya

I guess @manufrancis and @AnneSchmiedchen may also have use for such a convention in the corpora they are working with, in preparation for our future Epigraphical Glossary and for other lexicographical harvesting or our data.

What do you think? Whatever decision is made should presumably be added to both encoding guides.

Best wishes,

Arlo

danbalogh commented 3 years ago

I have not devoted much thought to word tagging, but indeed we will have to deal with it sooner or later. Let me break up your question into two parts: one on how to deal with word tagging versus sandhi, and another on what approach to use for lexical novelties.

On the first issue, I have no particular objection to your proposal of empty <w> tags (or was it I who once proposed that?), but now that I think about it, it doesn't really solve the problem for us. What if you want to tag the second of two sandhi-fused words or, horribile dictu, both of them? (Keeping in mind that we or someone else may want word tagging for purposes other than lexical novelties.) In that case putting an empty tag at the beginning doesn't help, since you still have to decide whether the "beginning" of a word includes the end of the previous word or drops the actual beginning of a word. What I mean is: do we choose A) <w lemma="kaparǝkanya"/>kaparǝkany<w lemma="ataḥ"/>ātaḥ OR B) <w lemma="kaparǝkanya"/>kaparǝkanyā<w lemma="ataḥ"/>taḥ ? My opinion at the moment is that since we have to make this choice for both empty and non-empty <w> tags, we may as well stick to non-empty ones. Next, as for the choice of A or B above, my preference is for A, i.e. always truncating former of two words fused in sandhi word in a sandhi-fused situation: <w lemma="kaparǝkanya"/>kaparǝkany<w lemma="ataḥ"/>ātaḥ. Two reasons for this choice spring to mind. 1: it's a fairly well established convention for Indian epigraphic texts, as in e.g. asy=opari, s-odaka, c=ānupālitā, etc. 2: if we ever implement automatic linking e.g. from a glossary to the encoded texts, then I think if you want to jump from the glossary to "ataḥ", it is preferable if it takes your cursor to |ātaḥ in the example to taking your cursor to ā|taḥ.

On the second issue, that specifically concerning lexical novelties, perhaps @ajaniak can suggest best practice for the actual markup elements and attributes to use. My intuition says that using <w> is OK, but @type is not the ideal attribute because it is too generic. Could we perpahs use @ana for this purpose, choosing a token for such cases, e.g. @ana="novel", and perhaps declaring the token in the header using <interp>?

arlogriffiths commented 5 months ago

I think this unfinished discussion may have some relevance when the epigraphical glossary starts to be developed, so I am reopening is.

erc-dharma / project-documentation

marking words for lexicographical harvesting #131