DARIAH-ERIC / lexicalresources

Data space of the DARIAH Lexical Resources Working Group
https://dariah-eric.github.io/lexicalresources/
BSD 2-Clause "Simplified" License
18 stars 24 forks source link

simplifying gramGrp #31

Closed ttasovac closed 4 years ago

ttasovac commented 5 years ago

Hi.

In Piotr's first go at the TEI Lex-0 customization, he did something which I very much approve of, he got rid of case gen iType mood number per subc tns.

Now, for the time being, I left gen because we have examples such as:

<form type="lemma">
  <orth expand="Hausandacht">-andacht</orth>
  <pc>,</pc>
  <gramGrp>
    <gen value="feminine">die</gen>
  </gramGrp>
</form>

and

<cit type="translationEquivalent">
  <form>
    <orth>assistance</orth>
    <pc>,</pc>
    <gramGrp>
      <gen>f.</gen>
    </gramGrp>
  </form>
</cit>

but what I would really like to do is streamline this even further, and allow only gram.

We currently use a lot of pos in our example (and I personally use pos exclusively in our dictionaries) but I think that for our purposes gramGrp/gram would be a better fit, precisely because we may need to encode things like pluralia tantum or something like that which is really not describing a part of speech.

So I just need some feedback on my proposal to allow only gram within gramGrp (and then work on the use of @type, @value etc. for it.

Keep in mind that this would mean that we can no longer have an example like this:

<gramGrp>
  <pos>vt</pos>
  <subc>VP2A</subc>
</gramGrp>

and that, as I said, we would have to recommend typology for , which is always hard, but I don't think it's tenable (especially for our generic tools down the line) to allow so many different elements within gramGrp.

laurentromary commented 5 years ago

This is something I would leave for discussion for the time being. We may have a communication issue with the wider community if we are too fierce on this. I myself would have difficulties to use my own examples :-(

ttasovac commented 5 years ago

I love it when @laurentromary is in Japan, he now responds literally in the middle of the night...

I do think this is very important, for several reasons:

  1. Processing-wise, both in ELEXIS and in terms of generic tools, there are two many elements allowed and no clear way of knowing which will be used and when. This is messy.
  2. Grammatical information in print dictionaries is always computationally totally imprecise: f. in <gen>f.</gen> implies more than gender: it actually means feminine noun — and on its own usually also singular. My vision here would be to say: everything is but you should try to use a morphosyntactic annotation system (if you have one and if you care) to spell things out, so <gen>f.</gen> would become <gram norm="Ncfsn">f.</gram> (MULTEXT notation off the top of my head, might not be correct, but it stands for Noun, common, feminine, singular, nominative)
  3. I actually don't think that too many real-life dictionaries use things like <tns>,<mood> etc.
  4. I do think it's very much in the spirit of TEI Lex-0 to go in this direction: We got rid of multiple elements for , I really really see any reason why we shouldn't trim down grammatical description.

In any case, let's get some more feedback and see if can find a common ground.

ttasovac commented 5 years ago

Ok, now I'm obsessing. But another thing I just noticed in the spec for <gram>: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-gram.html

Check out the proposed values for @type: pos, gen, num, animate, proper. Leaving aside the fact, that the suggested list mixes categories, which is a perennial problem for TEI attribute values, the gram/@type combo is — in vanilla TEI itself — already presented as a mechanism that can accomplish what elements like <pos> or <gen> do... So we wouldn't be actually doing anything super-revolutionary, but simply choosing one existing TEI mechanism over alternative mechanisms.

xlhrld commented 5 years ago

The Python zen says (among a couple of other things): »There should be one – and preferably only one – obvious way to do it.« Having both //pos and //gram[@type="pos"] is something that has bothered me for quite some time as well. A good typology for gram would be a prerequisite before getting rid of gen, pos and the like, of course. Those elements were introduced purely for convenience in the first place, I suspect (@laurentromary?).

Hopefully we can come up with a consensus on values for gram/@type to express, e.g.:

<gramGrp>
  <lbl>f.</lbl> <!-- due to its multiple functions, it may rather be encoded as a label -->
  <gram type="pos" value="noun"/>
  <gram type="gender" value="feminine"/>
  <gram type="number" value="singular"/>
</gramGrp>

(in fact I used empty gram as supplement for pos in our dictionaries for exactly the reason @ttasovac gives: there's more to the »f.« than just part-of-speech.)

The discussion will take time, so we should probably aim at 0.8.0 or even later, not 0.7.0, if we want to go in that direction (we should!).

kdepuydt commented 5 years ago

We would then suggest to strongly recommend declaration of the used set of grammatical features and values in the header.

ttasovac commented 5 years ago

This is an old thread and we have already made the decision to stick with <gram type="x">, but I think we need to more cleaning up in the schema. A smart student of mine said yesterday, after I told them to use gram instead of pos: but, wait, why is pos still allowed ! I felt like this was part of @laurentromary's secret plot to embarrass me in public, just for fun. 😄

I think we need three things here:

1. Elements to be axed

2. Create a list for gram types.

In the beginning, I think we should go for an open-ended list. Eventually, if we have a serious typology, we may consider closing it, but I think that would take a lot of work. So the initial list must include

because we got rid of them as elements. (With the exception of "pos" which is generally recognizable, I prefer full names rather than abbreviations as per or tns. 'iType' is especially obscure). I'd stop there for now until somebody embarks on a gram typology project.

3. Default value for gram type?

We have three options:

Any thoughts, comments?

laurentromary commented 5 years ago

I remain convinced that we should keep the two dialects here, so that we can scope both the editing use case and the integrating one. Existing practices do make the named elements still quite relevant. I am sure we could have something done with <alternate>.

ttasovac commented 5 years ago

Alas, I disagree with @laurentromary on this one. And judging by the reactions in the masterclass yesterday, it will be very confusing for the users if we keep both. Encoders want easy and singular options in the autosuggest dropdown in oXygen.

For those who are used to using the old elements (I for once have been using pos exclusively until now), we can provide very clear instructions — even a simple XSLT stylesheet — to let them convert the old gramGrp dialect to the new one.

I know that @laurentromary feels strongly about this, so we'll have to fight it out, but I also know that Laurent needs to go on his vacation in two days, and I want us to part on good terms 😄. So we can finalize this in August.

In the meantime, it would be nice to hear from the others.

ttasovac commented 5 years ago

Dear @ttasovac ,if it looks good to you, I can create the list of gram types. I already have all this data systematized in the case of the Portuguese, Spanish and French dictionaries of the Academy. I can make my contribution to an initial departure list.

Thanks @anacastrosalgado. Let's wait a little — we need to resolve our different visions about the elements first, and we can handle the types after that...

WGBS2 commented 5 years ago

Can I plead for subc? We need in our 17th c dictionaries

Le 4 juil. 2019 à 11:15, Toma Tasovac notifications@github.com a écrit :

This is an old thread and we have already made the decision to stick with , but I think we need to more cleaning up in the schema. A smart student of mine said yesterday, after I told them to use gram instead of pos: but, wait, why is pos still allowed ! I felt like this was part of @laurentromary https://github.com/laurentromary's secret plot to embarrass me in public, just for fun. 😄

I think we need three things here:

  1. Elements to be axed

case, gen, iType, mood, number, per and tns should be axed from model.morphLike pos and subc should be axed from model.lexicalRefinement

  1. Create a list for gram types.

In the beginning, I think we should go for an open-ended list. Eventually, if we have a serious typology, we may consider closing it, but I think that would take a lot of work. So the initial list must include

pos gender inflectionType mood number person tense because we got rid of them as elements. (With the exception of "pos" which is generally recognizable, I prefer full names rather than abbreviations as per or tns. 'iType' is especially obscure). I'd stop there for now until somebody embarks on a gram typology project.

  1. Default value for gram type?

We have three options:

make the default value — we are talking here about dictionaries and this will be the most common use of gram. think of a default, catch-all type (I really can't think of one which is not absurd like gram type="gram") leave the typing for gram to be optional to allow for dictionaries with more shallow encoding to do f.. Any thoughts, comments?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/31?email_source=notifications&email_token=AD63DP7U37QETXTRFGOZYJDP5W5RVA5CNFSM4FUPCYB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZG26HY#issuecomment-508407583, or mute the thread https://github.com/notifications/unsubscribe-auth/AD63DP54TQPRVRGDNN77ZUDP5W5RVANCNFSM4FUPCYBQ.

ttasovac commented 5 years ago

@WGBS2, if my proposal goes through, you would be free to use <gram type="subcategory"> instead of <subc>. But my general opposition to subc is that it's simply not semantically expressive enough.

If you do <gram type="pos">noun</gram> <gram type="gender">f.</gram> you will know in each case the exact nature of the grammatical information that is being recorded. A subcategory doesn't say what the thing is or what category it's subsumed under.

WGBS2 commented 5 years ago

It is however a clean and clear means of managing information that our lexicographers provide. I am not keen on overuse of type attributes when a good element exists. Dictionaries are complex and oversimplification renders their description too simplistic.

Envoyé de mon iPhone

Le 4 juil. 2019 à 16:01, Toma Tasovac notifications@github.com a écrit :

@WGBS2, if my proposal goes through, you would be free to use instead of . But my general opposition to subc is that it's simply not semantically expressive enough.

If you do noun f. you will know in each case the exact nature of the grammatical information that is being recorded. A subcategory doesn't say what the thing is or what category it's subsumed under.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ttasovac commented 5 years ago

@WGBS2 I think you're missing the point of both TEI Lex-0 and this whole thread — otherwise I don't think you would be jumping to conclusion that we are here in the business of oversimplifying or not fully understanding the complexity of dictionaries.

The issue we're discussing is quite different. It stems from the fact that TEI is overcomplicating things by allowing both <pos> and <gram type="pos"> to describe exactly the same thing. We're trying to streamline that.

xlhrld commented 5 years ago

I'm very much on the same side of the »conflict« like @ttasovac. For most applications, <pos> and <gram type="pos"> and all the other elements threatened to be axed are mutually equivalent. There is no clear direction for users, which encoding to chose. This is completely against the gist of Lex0. To restate my comment in https://github.com/DARIAH-ERIC/lexicalresources/issues/31#issuecomment-420573222: »There should be one – and preferably only one – obvious way to do it.«

I see only one notable difference between the two encodings, though. In principle, you could type <pos> but for obvious reasons you cannot type <gram type="pos">. Alas, typing <pos> and friends is not allowed in vanilla TEI and would need customization anyway.

ttasovac commented 4 years ago

@laurentromary and I discussed this today and I am happy to report that this long-standing ticket can now be closed. We've agreed on the simplified mechanism and the use of typed <gram> elements instead of the more granular but non-exhaustive elements like <pos>, <tns> etc.

I have documented this in Section 2.3 on Grammatical Properties, within Chapter 2: Entries.

I will close this ticket now, but if you spot anything that you feel needs further explanation, fell free to open a new ticket.