Valency information inside quotes

DARIAH-ERIC / lexicalresources

Data space of the DARIAH Lexical Resources Working Group

https://dariah-eric.github.io/lexicalresources/

BSD 2-Clause "Simplified" License

18 stars 24 forks source link

Valency information inside quotes #145

Open ttasovac opened 3 years ago

ttasovac commented 3 years ago

Hi everybody.

A while ago, we agreed to treat valency information in dictionaries as <gram type="colloc">[+ conj.]</gram>. See https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#collocates.

That's all fine and dandy as long as <gramGrp> is allowed there where we want to put the valency information — it's no problem at the entry level, it's no problem at the sense level etc.

Many dictionaries — including the Latin-Bulgarian dictionary that I just spoke to my Bulgarian colleagues about — put valency information inside examples.

<cit type="example">
  <quote>assum alicuius rebus </quote>         
  <cit type="translation" xml:lang="bg">
        <quote>помагам някому с нещо</quote>
   </cit>
</cit>

assum alicuius rebus = to help somebody with something, where somebody is in the genitive (or whatever the case that is) and something is in the ablative or dative or whatever it is.

My colleagues want to, and rightfully so, to be able to indicate inside the quote the grammatical information, something like:

<cit type="example">
  <quote>assum <gramGrp><gram type="collocate" value="genitive">alicuius</gram></gramGrp> rebus</quote>         
  <cit type="translation" xml:lang="bg">
        <quote>помагам някому с нещо</quote>
   </cit>
</cit>

Needless to say, gramGrp is not allowed within quote.

This is somewhat similar to the case we had where xr was not allowed within <def> and <quote> and we saw a clear need for it (see #24). We loved it by making <xr> member of emphLike so we can now have both def/xr and quote/xr...

I have no doubt that the need for grammatical info inside quotes is a fairly common phenomenon, especially in bilingual and learners' dictionaries. If we agree that that is the case, we need to make TEI Lex-0 capable of expressing this. Please share with me your thoughts on the subject — including if you can think of some other way of doing it than the way I sketched above.

bansp commented 3 years ago

Dear Toma,

An auxiliary question first: how do you expect the markup below to be actually rendered in the dictionary?

assum alicuius rebus

Thanks and best,

Piotr

On 30/04/2021 11:03, Toma Tasovac wrote:

Hi everybody.

A while ago, we agreed to treat valency information in dictionaries as |[+ conj.]|. See https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#collocates.

That's all fine and dandy as long as || is allowed there where we want to put the valency information — it's no problem at the entry level, it's no problem at the sense level etc.

Many dictionaries — including the Latin-Bulgarian dictionary that I just spoke to my Bulgarian colleagues about — put valency information inside examples.
assum alicuius rebus помагам някому с нещо
assum alicuius rebus = to help somebody with something, where somebody is in the dative and something is in the ablative case

My colleagues want to, and rightfully so, to be able to indicate inside the quote the grammatical information, something like:
assum alicuius rebus помагам някому с нещо
Needless to say, |gramGrp| is not allowed within |quote|.

This is somewhat similar to the case we had where xr was not allowed within || and || and we saw a clear need for it (see #24 https://github.com/DARIAH-ERIC/lexicalresources/issues/24). We loved it by making || member of |emphLike| so we can now have both |def/xr| and |quote/xr|...

I have no doubt that the need for grammatical info inside quotes is a fairly common phenomenon, especially in bilingual and learners' dictionaries. If we agree that that is the case, we need to make TEI Lex-0 capable of expressing this. Please share with me your thoughts on the subject — including if you can think of some other way of doing it than the way I sketched above.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/145, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACMGSJCFAO5FPB7YXT7UCTTLJW7NANCNFSM433YZ2HQ.

daliboris commented 3 years ago

TEI Lex-0 recommends the use of @type="collocate", not @type="colloc.

ttasovac commented 3 years ago

hi @daliboris you're absolutely right, I will fix this above.

@bansp What they want to show the user is something like: "I am helping somebody (genitive) with something (ablative)"...

iljackb commented 3 years ago

Hi Toma, all

I have seen this in numerous legacy dictionaries I've had to encode, and I think that it occurs frequently enough that needs to be allowed in .

If we don't make the best candidate for encoding grammatical information (i.e.g ), usable in the contexts that it commonly occurs, people will have to invent different solutions (I've seen in some data before actually), or customize to allow it to occur.

So I think we might as well add it to the Lex0 so this issue doesn't come up again, (which it surely will).

Best, Jack

On Fri, Apr 30, 2021 at 11:03 AM Toma Tasovac @.***> wrote:

Hi everybody.

A while ago, we agreed to treat valency information in dictionaries as <gram type="colloc">[+ conj.]. See https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#collocates .

That's all fine and dandy as long as is allowed there where we want to put the valency information — it's no problem at the entry level, it's no problem at the sense level etc.

Many dictionaries — including the Latin-Bulgarian dictionary that I just spoke to my Bulgarian colleagues about — put valency information inside examples.
assum alicuius rebus помагам някому с нещо
assum alicuius rebus = to help somebody with something, where somebody is in the dative and something is in the ablative case

My colleagues want to, and rightfully so, to be able to indicate inside the quote the grammatical information, something like:
assum alicuius rebus помагам някому с нещо
Needless to say, gramGrp is not allowed within quote.

This is somewhat similar to the case we had where xr was not allowed within and and we saw a clear need for it (see #24 https://github.com/DARIAH-ERIC/lexicalresources/issues/24). We loved it by making member of emphLike so we can now have both def/xr and quote/xr...

I have no doubt that the need for grammatical info inside quotes is a fairly common phenomenon, especially in bilingual and learners' dictionaries. If we agree that that is the case, we need to make TEI Lex-0 capable of expressing this. Please share with me your thoughts on the subject — including if you can think of some other way of doing it than the way I sketched above.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/145, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYQ2HHMDZE5VKDG3ELVLHLTLJW7PANCNFSM433YZ2HQ .

bansp commented 3 years ago

I think Jack has summed it up pretty well.

As far as my question goes (and abstracting from the "genitive" vs. "dative" in the two versions of the example above), you haven't really answered it yet, Toma.

What they want to show the user is something like: "I am helping somebody (genitive) with something (ablative)".

I asked about the rendering, because they are surely not going to use English, but rather Latin, and then, I am after the shape of the grammatical information. I presume it's going to be in Bulgarian. My concern is with the content of gramGrp (you've put a fragment of the example inside it), and with the role that the (new?) @value attribute is supposed to play. I think that part should be rethought.

ttasovac commented 3 years ago

Piotr, I don't see why rendering matters that much, we are talking here about how to encode an example which contains, as part of the quote, additional grammatical information. Since you asked about the rendering, I thought I gave it to you, albeit in English, which I thought would make things easier, but now I realize it just added another layer of confusion. Sorry about that. So let's say they want to render: assum alicuius [gen.] rebus

@value is not a new attribute in either TEI Lex-0 and TEI Lex, it's meant to show a value that is not explicitly represented in the source. So my reading here is that alicuius is a stand-in for "genitive" — the example says in Latin: I am helping somebody with something, but for pedagogical reasons it's important to the dictionary authors to stress that alicuius in this case is in the genitive. After all it's also a stand-in for any other pronoun or noun in the genitive.

In any case, I am totally open to suggestions for improvement or alternative encodings... I just don't quite understand your objection yet. How would you rather encode this:

<cit type="example">
  <quote>assum <gramGrp><gram type="collocate" value="genitive">alicuius</gram></gramGrp> rebus</quote>         
  <cit type="translation" xml:lang="bg">
        <quote>помагам някому с нещо</quote>
   </cit>
</cit>

All best, Toma

bansp commented 3 years ago

I asked about the rendering because I was wondering how the internals of gramGrp are expected to be handled, including language information. I thought that information about rendering would indirectly help me understand.

I am quite surprised by the inclusion of part of the example as the content of gramGrp/gram, which -- or so I understood it until today -- should contain grammatical information only, meaning [gen.] or gen, depending on how you choose to render.

I think I understand the instinct that made you wrap gram around alicuius -- you wanted to tie the annotation to the token (in corpus-speak). And it does seem to be a bit of a weakness, for now at least, or at least for my understanding, that the association of the token to the gramGrp seems to only rely on juxtaposition.

Heck, what Jack has mentioned about seg no longer seems strange to me... Let's please explore that option as well.

<quote>assum <seg pos="[gen.]" valueDatcat="[URI...]">alicuius</seg> rebus</quote>

Gosh, not bad. Or at least requiring a serious rebuttal, I think. Thing is, we're switching perspectives at the point when we look at quotes -- from lexical to corpus-oriented.

ttasovac commented 3 years ago

Sure, let's consider that. I'm about to get out of the office so I can't respond right away, but I will try to get some other examples over the weekend...

bansp commented 3 years ago

Note also that if we were looking for a point of connection between lexica and grammars, then that would seem pretty close to a prototypical case.

daliboris commented 3 years ago

I think we must start by investigating what alicuius in this example means: for me, this pronoun represents semantic valency (an animate noun, not inanimate) and formal valency (genitive) of the verb assum, addesse.

This doesn't mean that the verb co-occurs frequently with the word alicuius, but it co-occurs with words which are expressed in genitive form and represents animate objects. (The Old Czech Dictionary, for example, uses common pronouns systematicaly for the description of the meaning; dáti co (komu) [to give somthing to somebody].)

I think that both parts of this valency should be expressed, not only the case (GEN), but also the animacy.

I use an existing <ab> element just to group grammatical information with the formal representation (<w>, or sequence of <w>); it's like grouping <form> and <gramGrp> within <entry> element.

<gramGrp> is not allowed within <ab> element, and <ab> is too general to be used in this case (in the dictionary), but it's just for the kicking-off purpose.

<cit type="example">
 <quote>assum 
  <ab type="valency">
   <w>alicuius</w>
   <gramGrp>
    <gram type="case" value="genitive"/>
    <gram type="animacy" value="animate"/>
   </gramGrp>
  </ab>
  rebus</quote>
 <cit type="translation" xml:lang="bg">
  <quote>помагам 
   <ab type="valency">
    <w>някому</w>
    <gramGrp>
     <gram type="case" value="dative"/>
     <gram type="animacy" value="animate"/>
    </gramGrp>
   </ab>
   <ab type="valency">
    <w>с</w> 
    <gramGrp>
     <gram type="case" value="instrumental"/>
    </gramGrp>
   </ab>
   <ab type="valency">
    <w>нещо</w>
    <gramGrp>
     <gram type="animacy" value="inanimate"/>
    </gramGrp>
   </ab>
   <!-- an alternative approach -->
   <ab type="valency">
    с нещо
    <gramGrp>
     <gram type="case" value="instrumental"/>
     <gram type="animacy" value="inanimate"/>
    </gramGrp>
   </ab>
  </quote>
 </cit>
</cit>

bansp commented 3 years ago

Please forgive me for treating "gen" as pos, above -- too little thinking does that to me, sometimes. One more take below.

My assumptions:

quotes are guests from the (widely understood) corpus realm inside the dictionary domain
we want to render as much as the original dictionary did
seg is less theory-laden than w, but w would be nearly just as good (and usable out-of-the-box)

(again, I have no precise info on the intended rendering, so to that extent, it's a guess)

<quote>assum <seg type="colloc" msd="gen">alicuius</seg> rebus</quote>

<quote>assum <w type="colloc" msd="gen">alicuius</w> rebus</quote>

I would assume that the dictionary rendering layer, upon encountering seg/type='colloc' (or w/type='colloc', for the believers), would beautify the information -- in this case, by adding square brackets (and maybe the dot) to "gen".

I don't think anymore that gramGrp belongs inside quote, because I take it as a dictionary-level mechanism that mirrors the seg- (or w-)internal corpus-level mechanism for encoding grammatical properties.

I am not sure how to react to Boris's proposal, because it clearly goes beyond the assumption that we only want to mark up what is there in the original. However, I think he does have a point that some dictionaries may provide more information about the collocate than just the case. (Here, the personal pronoun implies some semantic selection, but Boris's extended point stands anyway.)

Depending on the dictionary-encoders' needs, the above snippet can be slightly redone as follows:

<quote>assum <seg type="colloc" msd="gen,anim">alicuius</seg> rebus</quote>

<!-- UD-style -->
<quote>assum <seg type="colloc" msd="Case:Gen,Animacy:Anim">alicuius</seg> rebus</quote>

Note that there is nothing strange in adorning the segs a bit on their way from the (hypothetical) corpora: a single dictionary may use various corpora (or a single corpus at different stages), so a mapping would be a natural assumption. One little perk in a wider perspective could be that the mapping could be to a format agreed for by grammar that co-exists with the dictionary.

daliboris commented 3 years ago

Hello Piotr,

I think that the benefit of using <gramGrp> and <gram> elements inside <quote> is the ability to use identical elements within query (XPath) when processing dictionary programmatically.

On the other hand, <gramGrp> is defined as [element groups] morpho-syntactic information about a lexical item, not about a token. Consistent usage of the @msd attribute for tokens within quotted text sounds good for me.

The queistion is if the quotted text from the Latin-Bulgarian dictionary comes from concrete witness or it's an abstraction of more quotes from different sources. In the second case I think using <gramGrp> element can be justified.

ttasovac commented 3 years ago

Guys — just a quick note, I have a deadline today that I need to meet, that's why I haven't been able to participate over the last few days. I'll get back to you as soon (or rather: if :) I survive this deadline... Many thanks.

daliboris commented 3 years ago

I have proposition number two: if we (or encoders) want to render as much as the original dictionary did, we can put <app><note> elements with the explication (genitive) at the right place, where the responsibility attribute will refer to the author of the note.

<entry>
  <cit type="example">
   <quote>assum <w xml:id="w.1">alicuius</w><app resp="#encoder" from="#w.1"><note>gen.</note></app> rebus</quote>
   <cit type="translation" xml:lang="bg">
    <quote>помагам някому с нещо</quote>
   </cit>
  </cit>
</entry>

<entry>
  <cit type="example">
   <quote>assum <w xml:id="w.1">alicuius</w> rebus</quote>
   <cit type="translation" xml:lang="bg">
    <quote>помагам някому с нещо</quote>
   </cit>
  </cit>
</entry>
...
<listApp>
  <app resp="#encoder" corresp="#w.1 #w.312"><note>gen.</note></app>
</listApp>

This approach is less strict then above ones with more specific elements and attributes. But maybe sometimes it will be helpful, for example when the information within the entry can't be 100% identified (ie. unknown acronyms).