TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
270 stars 88 forks source link

Dictionaries:Homograph number is a property of the headword #2273

Open chr-emil opened 2 years ago

chr-emil commented 2 years ago

In the current dictionary module a homograph number (<hom> element) can only be a direct child of the element <entry>, <entryFree> and <dictScrap>. This is too limited. The entries below is taken from the Bokmålsordboka (Norwegian monolingual). The dictionary is digitally edited and is stored in a relation database. Branch: I gren m, f; el. II grein m, f (norrønt grein) ...

Lair: II gren n ...

Shape, stuff: I grein noun

To encode this in accordance to TEI, the <hom> element should also be a direct child of <form>, that is, on page 1264 dictionaries: dictScrap entry entryFree should be dictionaries: dictScrap entry entryFree from It will be beneficial if the element <form> can be extended with a hom attribute. The example used on page 1264 represents the CoBuild Dictionary tradition, which is not an universal tradition

sydb commented 2 years ago

@chr-emil (and @laurentromary) —

Is this a request for <hom> to be a child of <form> (seems reasonable, even if I don’t do things that way) or for a new @hom of <form>, or both?

bansp commented 2 years ago

Dear Christian-Emil, <hom> isn't meant to hold a label -- it's a grouping element for entry-sized objects.

https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-hom.html

chr-emil commented 2 years ago

@sydb @laurentromary @bansp I am aware that the guidelines suggest that <hom> is used as a way to group inside one entry dictionary entry information for homographs. To me this seems to be inspired by the Colin Cobuild dictionary format. In the Cobuild format there was for example one entry for comment (v) and comment(noun). It is ok to use <hom> as suggested by the guidelines as long as there is one headword per entry. In this case the entries will follow in a nice alphabetic order.

Homographs is, to my understanding, two (or more words/lexical items) with identical spelling in the selected base form (e.g. singular, nominative for nouns) and pos. The latter may not be the case in all definitions of homography. In some languages where the written standard allows for various spellings, for example Norwegian, the concept lexical item comprises will comprise the different orthographic variants. For example 'branch' can in Norwegian be spelled 'gren' or 'grein'. Both belongs to the same lexical item. There is another lexical item 'gren'. To differentiate the two 'gren' we use homograph numbers, 'I gren' and 'II gren'. Since 'gren' and 'grein' are not at the same place in an alphabetic order, the solution will not work. So one cannot use the current recommendation to encode many Norwegian printed (and electronic) dictionaries.I question the usefulness of the current <hom> element.

We need a way to encode the 'I gren' e.g.<entry><form><hom>I</hom><orth>gren</orth>....</form> or may be <entry><form><orth hom="I">gren</orth>....</form>, although the latter change the original text.

This is why I would like to have a <hom>-element as a direct child of the <form>

bansp commented 2 years ago

I understand that <form n="I"> is not satisfactory, in this case? It does change the text, on your assumption that attributes do not show up (which, since P5, expresses rather a tendency than a principle, @sydb correct me please if I'm wrong).

Would a solution involving <lbl> help? lbl can take a @type, so one could imagine something like

<entry><form><lbl type="hom">I</lbl><orth>gren</orth>....</form>

-- and I'm not taking a position on whether the label is part of the form (ontologically, I'd hesitate, but I can imagine an argument on that point), because you could also have the label outside of <form>, as a child of <entry>.

In general, @chr-emil , you probably recall work done on TEI Lex0 (referenced above, even if indirectly) -- I wonder if you'd agree with that approach to <hom> (which, essentially, eliminates it, in favour of other constructs).

chr-emil commented 2 years ago

@bansp @sydb It is of course possible to use the n attribute like indicated in the example at page 298. Originally TEI was led by the philosophy that removing the tags would leave the plain text untouched. This strategy has been left. This is ok. However, if one wants to mark up a dictionary text as it stands in the printed version, it would be nice to have an element for the hom-number instead of relying on extensive use of xsl. can serve that purpose.

I remember TEI LEX0 and have rechecked it. It is clearly more disciplined than the dictionary module. In many cases the old element name goes into the type attribute. This is fine and corresponds to suggestions in the person, places module. The snag is that then the type attribute is used for TEI-internal structural purposes. Anyhow, I will check my encoding against TEI lex(0).

My current task is to give a 12 volume Norwegian monolingual dictionary a TEI-conformant mark-up. The dictionary describes the written standard 'Nynorsk' and dialectal forms. It is complex.. The last 8 volumes are expressed in an internal xml-mark-up reflecting the structure of the editorial database. My group was responsible for the development of this database. It should not be too difficult to transform this into TEI. The first 2 volumes where edited in 1950-1978 and the electronic text is base on a manual keying of the printed edition. The text of volumes 3,4 are born digitally in a simple form/line based format sometimes shoehorning the text into the format. I manage to do this it will be a crash test of TEI and/or TEI lex(0).

sydb commented 1 year ago

@chr-emil: Do you still want <hom> as a child of <form>, or have you decided to use some other method? If you still need it, we (Council) think we should put <hom> into model.formPart.

bansp commented 1 year ago

@sydb but does the Council recall that <hom> stores large objects (up to entry-sized) rather than individual numbers, and it is the latter that Christian-Emil needed a container for? :-)

ttasovac commented 1 year ago

As @bansp mentioned, TEI Lex-0 is really not a fan of <hom> :)

But I wonder, @chr-emil, whether you really need <hom> for <entry><form><hom>I</hom><orth>gren</orth>....</form> when you can do <lbl type="hom">I</lbl>. After all, homograph numbers are just a particular type of dictionary labels.

Alternatively you could also use <num type="hom">I</num>.

chr-emil commented 1 year ago

I now see that in September there was a discussion about this. @bansphttps://github.com/bansp and @sydbhttps://github.com/sydb. I have not found any solution since I have been working with other stuff. I repeat my issue below. I can always used 'n' but it is a little unsatisfactory. The '' element as defined in TEI P5 will never be used in Norwegian dictionaries and not in any other dictionary I have encountered. So my view is that it can go and we can define it as a sub element of '

' and may be a new hom-attribute

The fundamental question is: To which degree should a TEI version of a dictionary text encode according to a lexicographical/lexicological model and tradition. For example, what is a homograph and how is incarnated in the dictionary text.

In the (original COBUILD) format a verb and a noun is defined in the same entry in different numbered senses , e.g. 'comment'

(https://www.collinsdictionary.com/dictionary/english/comment for the digital version).

In Norwegian the word 'gauk' ('cockoo') can denote the bird (most common) or a moonshiner (almost out of use). To complicate slightly in the Bokmål written standard 'gjøk' (from Danish) and 'gauk' are synonyms but only the latter can have the meaning moonshiner. In the Bokmål dictionary there are two entries

1)

[https://ordbok.uib.no/grafikk/Letter-K-blue-icon.png] gjøk m1; el. II gauk m1 (norrøntgaukr, påvirket av dansk) .... defined as the the bird or a funny person

2) I gauk m1 (samme opprinnelse som gjøk) defined as a moonshiner

In the other written standard Nynorsk there is one entry gauk

gauk m1 (norrønt gaukr, lydord)

Def 1 the bird, Def 2 things similar to a cockoo Def 3 a top part of the gable of a log building Def 4 a funny person Def 5 moonshiner

In Bokmål we find the homographs I gauk and II gauk In Nynorsk it is only one lexical item and the editing is closer to the cobuild idea. However, a noun and a verb is never defined in the same entry. Here one usually will indicate homographs . For example the verb 'tømme' (empty) and the noun 'tømme' (rein):

II tømme verb (norrønt tǿma; av II tom) ... I tømme m1 (norrønt taumr; samme opprinnelse som I tom) ,,,

Some months ago I posted an issue on github about the best encoding of the roman numbers indicating homographs. As you see from the examples a pair (homograph number, word form) can be used as cross references. So the roman numbers should some how be linked to the headword form perhaps as an attribute. But what about the roman numbers in the text, which element should be used. I got some suggestions from @bansphttps://github.com/bansp and @sydbhttps://github.com/sydb


From: Toma Tasovac @.***> Sent: 18 October 2022 06:08 To: TEIC/TEI Cc: Christian-Emil Smith Ore; Mention Subject: Re: [TEIC/TEI] Dictionaries:Homograph number is a property of the headword (Issue #2273)

As @bansphttps://github.com/bansp mentioned, TEI Lex-0 is really not a fan of :)

But I wonder, @chr-emilhttps://github.com/chr-emil, whether you really need for Igren.... when you can do I. After all, homograph numbers are just a particular type of dictionary labels.

Alternatively you could also use I.

— Reply to this email directly, view it on GitHubhttps://github.com/TEIC/TEI/issues/2273#issuecomment-1281787912, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHCV2JQURLNQOTN3CROWWELWDYPDNANCNFSM5TNPQXIA. You are receiving this because you were mentioned.Message ID: @.***>