DARIAH-ERIC / lexicalresources

Data space of the DARIAH Lexical Resources Working Group
https://dariah-eric.github.io/lexicalresources/
BSD 2-Clause "Simplified" License
18 stars 24 forks source link

Modelling related entries in case of homographs with different POS #48

Closed MedKhem closed 4 years ago

MedKhem commented 5 years ago

In the example below, modelling the entry with 2 senses, differentiated by POS, we'll lead us to the same issue as in #43 where we need an \<entry> inside \<sense> which is not a valid TEI option and looks a bit weird to my eyes (and Toma's).

arrest

A possible modelling would be enabling nested entries and consider the construct \<gramGrp> and \<sense> as an \<entry>:

<entry xml:lang="en" xml:id="arrest">
            <form>arrest /ə'rest/</form> 
            <entry xml:lang="en" xml:id="arrestVerb">
               <gramGrp>
                  <gram type="pos">verb</gram>
               </gramGrp>
               <sense>(of the police) to catch and hold someone who has broken the law The
                  police arrested two men and took them to the police station. He ended up
                  getting arrested as he tried to leave the country. She was arrested for
                  stealing</sense></entry>. í 
            <entry xml:lang="en" xml:id="arrestNoun">
               <gramGrp>
                  <gram type="pos">noun</gram>
               </gramGrp> <sense>the act of holding someone for breaking the law
                  The po-lice made several arrests at the demon-stration</sense>. 
               <entry type="mwe" xml:id="underArrest" xml:lang="en">under arrest held
                  by the po-lice After the fight, three people were under arrest</entry>
            </entry>
</entry>

This way, both of the homographs are treated equally and entries and senses (in dictionaries where lexicographers represent homographs as separate articles) could be easily mapped to constructs from the same category.

What do you think about this?

ttasovac commented 5 years ago

I think this is the way to go. I always thought it was strange to treat homographs with different parts of speech as different senses of a prototypical pos-less headword. Even though in some dictionaries they may appear like senses (i.e. they could be numbered etc.) I think this is a much cleaner way to model these.

I would go as far as saying that we should recommend that in all those cases when one entry contains mutltiple parts of speech (a la arrest-v. and arrest-n.), we should try to treat them as nested entries and not as senses.

bansp commented 5 years ago

I think it would be a bit arrogant to try and cut off part of the European lexicographic tradition due to some technical difficulties in a format which need not survive the coming ten years, so I trust that this is not the gist of the proposal here. After all, string identity is a pretty strong measure, and a justifiable choice for where POS identification has to allow a smaller or larger degree of arbitrariness (what's the POS of "near", please?). It is one thing to recommend choices of vocabulary (and their mapping to the local element/attribute choices) to lexicographers facing the task of retrodigitization, and something totally different to force them into micro- and macrostructural choices that they may not wish to make (or that they may not have the right to make). Cheers!

xlhrld commented 5 years ago

That's also related to #14 where the discussion is based on the Dutch achter example.

@bansp We shouldn't forget that our aim with Lex0 is primarily to provide a somewhat general baseline encoding that caters for the vast majority of lexical models. There will always remain cases where Lex0 will not suffice. In the case at hand, Lex0 still allows entry/sense/gramGrp for the entry provided by @MedKhem. We're not going to »cut« that »off«. However, with recursive entry, modelling this as entry/entry/gramGrp also becomes possible. The question rather is whether to recommend the latter. To me too, arrest, noun and arrest, verb seem (and smell and taste) like individual entries.

And then again … »What's in a TEI name?«, anyway, remember? ;) If it looks like an entry, why not call it an entry?

ttasovac commented 5 years ago

@bansp who's "cutting off" parts of the European lexicographic tradition? I have no idea what you're talking about.

In #43 we made entry member of sense.Part so that we can have entry wherever re used to be. And here we're talking here about recommending entry/entry/gramGrp for homographic entries with different parts of speech as in arrest noun and arrest verb (as opposed toentry/sense/gramGrp).

Anywho. This issue will remain open while until we finalize the stuff we started talking about in our last meeting in Berlin (collocs, MWEs, typology of entries) and then we'll see how all these mutually related issues work together.

WGBS2 commented 5 years ago

Hi,

Not sure how I found myself on this list except for my interest in TEI Lex0. Not sure why it landed on this address either as it is not my lexicographical one.

I cannot but agree with Piotr. I have worked on dictionaries in many languages and from many different centuries, and nowhere in any lexicographical tradition has entry been a child of sense. There example given is the sort of thing that English learners’ dictionaries might play with, and would be easily handled as a pair of related entries under entry in the same way as in French I have for part of speech changes when a past participle comes in as a modifier.

I cannot see how a change for surely technical reasons can be anything but detrimental and open that the lexicographical community, including retrodigitisers, would find reprehensible.

But I may be speaking out of turn.

Geoffrey

Le 14 févr. 2019 à 19:04, Piotr Banski notifications@github.com a écrit :

I think it would be a bit arrogant to try and cut off the part of European lexicographic tradition due to some technical difficulties in a format which need not survive the coming ten years, so I trust that this is not the gist of the proposal here. After all, string identity is a pretty strong measure, and a justifiable choice for where POS identification has to allow a smaller or larger degree of arbitrariness (what's the POS of "near", please?). It is one thing to recommend choices of vocabulary (and their mapping to the local element/attribute choices) to lexicographers facing the task of retrodigitization, and something totally different to force them into micro- and macrostructural choices that they may not wish to make (or that they may not have the right to make). Cheers!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/48#issuecomment-463729223, or mute the thread https://github.com/notifications/unsubscribe-auth/AP2xv0GqJRhwM5l3VaV2Tpm_25jqztotks5vNaVGgaJpZM4a7yyc.

WGBS2 commented 5 years ago

Hi,

Not sure how I found myself on this list except for my interest in TEI Lex0. Not sure why it landed on this address either as it is not my lexicographical one.

I cannot but agree with Piotr. I have worked on dictionaries in many languages and from many different centuries, and nowhere in any lexicographical tradition has entry been a child of sense. There example given is the sort of thing that English learners’ dictionaries might play with, and would be easily handled as a pair of related entries under entry in the same way as in French I have for part of speech changes when a past participle comes in as a modifier.

I cannot see how a change for surely technical reasons can be anything but detrimental and open that the lexicographical community, including retrodigitisers, would find reprehensible.

But I may be speaking out of turn.

Geoffrey

Le 14 févr. 2019 à 19:04, Piotr Banski <notifications@github.com mailto:notifications@github.com> a écrit :

I think it would be a bit arrogant to try and cut off the part of European lexicographic tradition due to some technical difficulties in a format which need not survive the coming ten years, so I trust that this is not the gist of the proposal here. After all, string identity is a pretty strong measure, and a justifiable choice for where POS identification has to allow a smaller or larger degree of arbitrariness (what's the POS of "near", please?). It is one thing to recommend choices of vocabulary (and their mapping to the local element/attribute choices) to lexicographers facing the task of retrodigitization, and something totally different to force them into micro- and macrostructural choices that they may not wish to make (or that they may not have the right to make). Cheers!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/48#issuecomment-463729223, or mute the thread https://github.com/notifications/unsubscribe-auth/AP2xv0GqJRhwM5l3VaV2Tpm_25jqztotks5vNaVGgaJpZM4a7yyc.

anacastrosalgado commented 5 years ago

In Portuguese Academy Dictionary printed edition (2001), we have grammatical homonymous which are separated into different entries (see below). In the first encoding, we used for that purpose.

<group>
    <gramGrp>adj.</gramGrp>
    <sense revisto="14/03/2017" novo="05/07/2016">
        <def>Relativo ou pertencente a Albânia (país da Europa).</def>
    </sense>
</group>
<group>
    <gramGrp>n. m., f.</gramGrp>
    <sense revisto="14/03/2017" novo="14/03/2017">
        <def>Natural, habitante ou cidadão da Albânia.</def>
    </sense>
</group>
<group>
    <gramGrp>n. m.</gramGrp>
    <sense revisto="14/03/2017">
        <def>Língua indo-europeia falada principalmente na Albânia.</def>
    </sense>
</group>

Now, the cases of homonymy are encoded as follows:

capital :1 kɐpitˈał
adj. m. e f. P. us. Que é relativo a cabeça. Que está relacionado com a condenação à morte. Os jurados pronunciaram-se a favor de uma sentença capital para o criminoso.
execução capital
pecado capital
ou execução
pena capital
Que é de primeira ou grande importância. essencial fundamental principal Preocupação capital. Assunto de interesse capital.
letra capital
pecado capital
Do la.
capitālis
        <entry xml:id="DACL.CAPITAL:2" xml:lang="pt">
           <form type="lemma">
              <orth>capital</orth>
              <lbl>:2</lbl>
              <pron>kɐpitˈał</pron>
           </form>
           <gramGrp>
              <pos>n.</pos>
              <gen>f.</gen>
           </gramGrp>
           <sense xml:id="DACL.CAPITAL.6" n="1">
              <def>Cidade onde está situada a sede administrativa de um país, província, região... </def>
              <cit type="example">
                 <quote>Duarte, um mês depois, era preso, interrogado, e remetido para a capital, onde a identidade da pessoa foi de muitos reconhecida.</quote>
                 <bibl><author>CAMILO</author>, <title>As Três Irmãs</title>, <citedRange>151</citedRange></bibl>
              </cit>

              <entry xml:id="DACL.CAPITAL.8." xml:lang="pt">
                 <form><orth>+ de distrito.</orth></form>
              </entry>
           </sense>
           <sense xml:id="DACL.CAPITAL.9" n="2">
              <def>Cidade que constitui o centro de uma actividade.</def>
              <cit type="example">
                 <quote>Diz-se que Paços de Ferreira é a capital do móvel.</quote>
              </cit>
           </sense>
           <sense xml:id="DACL.CAPITAL.10" n="3">
              <def>Letra maiúscula; letra maiúscula que inicia um capítulo.</def>
           </sense>
           <etym>
              <seg type="desc">Do</seg>
              <cit type="etymon"> <lang>la.</lang> <form><orth xml:lang="la">capitālis</orth></form>
              </cit>
           </etym>
        </entry>

capital.pdf

bansp commented 5 years ago

My goodness, I have managed to overlook e-mails with the replies, and from all the reactions I infer that I was entirely wrong thinking that I made my stance clear. Apologies for what must have seemed an incoherent message followed by silence. I'll say even more, and I do that cringing: upon re-reading Mohamed's message, and the entries that followed, I now fully understand that I was the sole cause of the misunderstanding, and I am now triply embarrassed. I am not even sure if I should "elucidate" what I meant, given that what I meant was based on a misconception for which I alone am to blame. Heartfelt apologies to all involved (and a promise to myself to stop thinking that I can procrastinate one job by "making a quick stab" at another). (OK, so just a quickie: I thought, wrongly, that you guys were pondering recommending an actual change of the macrostructure based on, let's call it, "lemmatization strategy" of the original author. Obviously, no one did that and I should have read the first two messages much more carefully, rather than focus on a single passage cut out of the whole.)

bansp commented 5 years ago

@WGBS2 Dear Geoffrey, I apologise for leading you astray with my message that I only now identified as fully incoherent (and therefore open to various interpretations, of which you chose one).

You point out an important thing that we were also alerted to by Katrien, repeatedly, namely that in our our more or less innocent modelling strategies, we should pay very close attention to the feelings of born and bred lexicographers, maybe not necessarily bending some modelling decisions, but certainly very clearly explaining the difference between well-established lexicographic vocabulary on the one hand, and the vocabulary of a very restricted set of TEI-XML modelling choices on the other. What is well established as a concept in one realm (e.g. "entry") need not correspond one-to-one to the name-of-an-element in the other realm. In other words, the TEI XML "entry" is an element name that only in a subset of cases corresponds directly to the lexicographic concept of the entry. Thank you for reminding us of the need to be very careful here.

As for the e-mail address that this goes to, it must be the address associated with your GitHub account, and adding your account to this group was Toma's only way of including you in the GitHub environment. You might want to either change the address associated with the "WGBS2" account, or create another account with a different e-mail address, and then let Toma know about it. (The latter might be a suboptimal choice, however, and cumbersome in the long run.) Or you might want to filter messages from GitHub based on the "notifications@github.com" sender.

MedKhem commented 5 years ago

thank you @bansp for clarifying the misunderstanding.

@anacastrosalgado thank you for sharing your examples. The first one seems to further illustrate the issue raised here where \<group> has been used to play the role of \<entry>, as \<entry> had been not yet made recursive.

For the second example, I'm not sure if it's related to the case of homographs we are discussing here. Maybe you could develop on this? :)

ttasovac commented 4 years ago

This is now well-documented. See https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#nested-entries-vs-multiple-senses