DARIAH-ERIC / lexicalresources

Data space of the DARIAH Lexical Resources Working Group
https://dariah-eric.github.io/lexicalresources/
BSD 2-Clause "Simplified" License
18 stars 24 forks source link

how to encode stuff like "+ acc.", prepositions that go with verbs etc.. #40

Closed ttasovac closed 4 years ago

ttasovac commented 5 years ago

issue raised by Charly's talk during #lexMC

ttasovac commented 5 years ago

In TEI proper this would probably have been <usg type='gram'> — but we don't have this option anymore because we have reduced the scope of <usg>... so we need to figure out what to do here...

iljackb commented 5 years ago

Crazy idea, but I say let's bring back <usg type="gram>!!! It's too frequently needed to have removed it.It will make TEI-Lex0 conformance arbitrary for all the projects who use it to then have to make their data more vague.

bansp commented 5 years ago

That is just a grammatical abstraction of a collocate... @iljackb Your second sentence is so complex that I am thinking you must have been really tired in the evening :-) "arbitrary" and "more vague" are pretty significant there and serve well as arguments against the idea in the first sentence ;-)

Back to the topic, it would be nice to at some point introduce and then follow a principled distinction between a collocation, which according to the generalised use of <cit> may qualify as cit/quote (or <form>, or whatever the trend is, this year), and a collocate, which is (often an abstraction of) what forms a collocation with the headword. Reintroducing <usg type="gram"> for such cases would mean unravelling whatever got built in Lex0. Please let's keep this as the margin that we don't want to reach.

xlhrld commented 5 years ago

+1 for not bringing usg/@type="gram" back to life again. usg should be used for all this socio- and para-linguistic stuff (the real world settings for the speech production), not for collocations and certainly not for syntactic descriptions (the internals of language as a system) such as the one that actually started this thread. Like @bansp, I'd see roughly this division of labour: cit for linguistic instantiations of (abstract) syntactic constructions and maybe colloc for the actual collocate (in the case of collocations).

Considering more closely the »+ acc.« part of the initial question: more context would be nice. Still, to me this seems like meaning something along the lines of »$lemma can be used with an adjunct in the accusative case«, so maybe gram/@type="hasAccusativeAdjunct".

bansp commented 5 years ago

Hah, I was just thinking about this this morning and was going to suggest a similar approach to Axel's as regards collocations, which occupy the area roughly fenced by <cit> (they could be examples, equivalents, and I guess also heads of related entries, whatever we do about those now).

I would then use <colloc> for collocates (faithfully to the original TEI definition, I think), although I wouldn't venture as far as naming grammatical functions, so I would suggest not to say "adjunct" (especially where some linguists would shout "object!", and some others "complement!", etc.). I think what <colloc> may be missing is a @pattern attribute which would state the relationship of the collocate to the headword. So, for example (and @charlymo, would you give us your example, please?) I would say something like this:

<colloc pattern="$ _">+acc.</colloc>

(assuming arbitrarily that '$' stands for the headword and '_' for the element content; there already exist @match and @matchPattern, so maybe the "pattern" here could be one of those, with appropriate modifications)

To be sure, the content of @pattern and the "+" may be seen as redundant here, but I am rather sure that we would find examples where the "+" means just "with" rather than "followed by".

In the example cited by Jesse elsewhere (for colloc inside def), I would say the first pass digitization could indeed encompass all the relevant string, but then, upon refinement, I think I would like to see a sequence of <colloc>s there.

kdepuydt commented 5 years ago

Either acc, possibly with multiple values, or Piotrs <colloc pattern=” “> seems fine.

ttasovac commented 4 years ago

I have spoken to several people about this after presenting on this at the collocation workshop at eLex, so the general consensus is that when we have something like: a não ser que [+conj.], we do:

<gramGrp>
  <gram type="colloc">
     [+conj.]
  </gram>
</gramGrp>

This is consistent with our simplification of gramGrp to use only typed gram elements, and not give special treatment to pos or colloc or anything like that...

bansp commented 4 years ago

Reading the above makes me think that +conj is a grammatical property of the phrase a não ser que rather than the property of what (usually or always) follows it. Is my reading the intended one?

Edit: back from a short walk along the memory lane... now I recall what the reasoning was for <gram>. And the answer to my question above (which I don't delete, because it's there in your mailboxes already) is "no, because type='colloc' switches the focus to what's around". Cheers.

ttasovac commented 4 years ago

exactly, @bansp! we went back and forth on this one, but @type="colloc" should be enough to indicate the switch of perspective here.

changing status to "documentation" as a reminder to @ttasovac to add examples to the guidelines

ttasovac commented 4 years ago

We have a section in the Guidelines about this now (Section 2.3.3 Collocates in Chapter 2: Entries.)

I will try to add more examples on that imaginary day when I suddenly realize I have lots of free time...