Closed ttasovac closed 4 years ago
This is something I would leave for discussion for the time being. We may have a communication issue with the wider community if we are too fierce on this. I myself would have difficulties to use my own examples :-(
I love it when @laurentromary is in Japan, he now responds literally in the middle of the night...
I do think this is very important, for several reasons:
f.
in <gen>f.</gen>
implies more than gender: it actually means feminine noun — and on its own usually also singular. My vision here would be to say: everything is <gen>f.</gen>
would become <gram norm="Ncfsn">f.</gram>
(MULTEXT notation off the top of my head, might not be correct, but it stands for Noun, common, feminine, singular, nominative)<tns>
,<mood>
etc.In any case, let's get some more feedback and see if can find a common ground.
Ok, now I'm obsessing. But another thing I just noticed in the spec for <gram>
: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-gram.html
Check out the proposed values for @type
: pos, gen, num, animate, proper. Leaving aside the fact, that the suggested list mixes categories, which is a perennial problem for TEI attribute values, the gram/@type
combo is — in vanilla TEI itself — already presented as a mechanism that can accomplish what elements like <pos>
or <gen>
do... So we wouldn't be actually doing anything super-revolutionary, but simply choosing one existing TEI mechanism over alternative mechanisms.
The Python zen says (among a couple of other things): »There should be one – and preferably only one – obvious way to do it.« Having both //pos
and //gram[@type="pos"]
is something that has bothered me for quite some time as well. A good typology for gram
would be a prerequisite before getting rid of gen
, pos
and the like, of course. Those elements were introduced purely for convenience in the first place, I suspect (@laurentromary?).
Hopefully we can come up with a consensus on values for gram/@type
to express, e.g.:
<gramGrp>
<lbl>f.</lbl> <!-- due to its multiple functions, it may rather be encoded as a label -->
<gram type="pos" value="noun"/>
<gram type="gender" value="feminine"/>
<gram type="number" value="singular"/>
</gramGrp>
(in fact I used empty gram
as supplement for pos
in our dictionaries for exactly the reason @ttasovac gives: there's more to the »f.« than just part-of-speech.)
The discussion will take time, so we should probably aim at 0.8.0 or even later, not 0.7.0, if we want to go in that direction (we should!).
We would then suggest to strongly recommend declaration of the used set of grammatical features and values in the header.
This is an old thread and we have already made the decision to stick with <gram type="x">
, but I think we need to more cleaning up in the schema. A smart student of mine said yesterday, after I told them to use gram
instead of pos
: but, wait, why is pos
still allowed ! I felt like this was part of @laurentromary's secret plot to embarrass me in public, just for fun. 😄
I think we need three things here:
1. Elements to be axed
case
, gen
, iType
, mood
, number
, per
and tns
should be axed from model.morphLike
pos
and subc
should be axed from model.lexicalRefinement
2. Create a list for gram types.
In the beginning, I think we should go for an open-ended list. Eventually, if we have a serious typology, we may consider closing it, but I think that would take a lot of work. So the initial list must include
because we got rid of them as elements. (With the exception of "pos" which is generally recognizable, I prefer full names rather than abbreviations as per
or tns
. 'iType' is especially obscure). I'd stop there for now until somebody embarks on a gram typology project.
3. Default value for gram type?
We have three options:
<gram type="pos">
the default value — we are talking here about dictionaries and this will be the most common use of gram
. <gramGrp><gram>f.</gram></gramGrp>
.Any thoughts, comments?
I remain convinced that we should keep the two dialects here, so that we can scope both the editing use case and the integrating one. Existing practices do make the named elements still quite relevant. I am sure we could have something done with <alternate>
.
Alas, I disagree with @laurentromary on this one. And judging by the reactions in the masterclass yesterday, it will be very confusing for the users if we keep both. Encoders want easy and singular options in the autosuggest dropdown in oXygen.
For those who are used to using the old elements (I for once have been using pos
exclusively until now), we can provide very clear instructions — even a simple XSLT stylesheet — to let them convert the old gramGrp dialect to the new one.
I know that @laurentromary feels strongly about this, so we'll have to fight it out, but I also know that Laurent needs to go on his vacation in two days, and I want us to part on good terms 😄. So we can finalize this in August.
In the meantime, it would be nice to hear from the others.
Dear @ttasovac ,if it looks good to you, I can create the list of gram types. I already have all this data systematized in the case of the Portuguese, Spanish and French dictionaries of the Academy. I can make my contribution to an initial departure list.
Thanks @anacastrosalgado. Let's wait a little — we need to resolve our different visions about the elements first, and we can handle the types after that...
Can I plead for subc? We need in our 17th c dictionaries
Le 4 juil. 2019 à 11:15, Toma Tasovac notifications@github.com a écrit :
This is an old thread and we have already made the decision to stick with
, but I think we need to more cleaning up in the schema. A smart student of mine said yesterday, after I told them to use gram instead of pos: but, wait, why is pos still allowed ! I felt like this was part of @laurentromary https://github.com/laurentromary's secret plot to embarrass me in public, just for fun. 😄 I think we need three things here:
- Elements to be axed
case, gen, iType, mood, number, per and tns should be axed from model.morphLike pos and subc should be axed from model.lexicalRefinement
- Create a list for gram types.
In the beginning, I think we should go for an open-ended list. Eventually, if we have a serious typology, we may consider closing it, but I think that would take a lot of work. So the initial list must include
pos gender inflectionType mood number person tense because we got rid of them as elements. (With the exception of "pos" which is generally recognizable, I prefer full names rather than abbreviations as per or tns. 'iType' is especially obscure). I'd stop there for now until somebody embarks on a gram typology project.
- Default value for gram type?
We have three options:
make
the default value — we are talking here about dictionaries and this will be the most common use of gram. think of a default, catch-all type (I really can't think of one which is not absurd like gram type="gram") leave the typing for gram to be optional to allow for dictionaries with more shallow encoding to do . Any thoughts, comments? f. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/31?email_source=notifications&email_token=AD63DP7U37QETXTRFGOZYJDP5W5RVA5CNFSM4FUPCYB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZG26HY#issuecomment-508407583, or mute the thread https://github.com/notifications/unsubscribe-auth/AD63DP54TQPRVRGDNN77ZUDP5W5RVANCNFSM4FUPCYBQ.
@WGBS2, if my proposal goes through, you would be free to use <gram type="subcategory">
instead of <subc>
. But my general opposition to subc
is that it's simply not semantically expressive enough.
If you do <gram type="pos">noun</gram> <gram type="gender">f.</gram>
you will know in each case the exact nature of the grammatical information that is being recorded. A subcategory doesn't say what the thing is or what category it's subsumed under.
It is however a clean and clear means of managing information that our lexicographers provide. I am not keen on overuse of type attributes when a good element exists. Dictionaries are complex and oversimplification renders their description too simplistic.
Envoyé de mon iPhone
Le 4 juil. 2019 à 16:01, Toma Tasovac notifications@github.com a écrit :
@WGBS2, if my proposal goes through, you would be free to use
instead of . But my general opposition to subc is that it's simply not semantically expressive enough. If you do
noun f. you will know in each case the exact nature of the grammatical information that is being recorded. A subcategory doesn't say what the thing is or what category it's subsumed under.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@WGBS2 I think you're missing the point of both TEI Lex-0 and this whole thread — otherwise I don't think you would be jumping to conclusion that we are here in the business of oversimplifying or not fully understanding the complexity of dictionaries.
The issue we're discussing is quite different. It stems from the fact that TEI is overcomplicating things by allowing both <pos>
and <gram type="pos">
to describe exactly the same thing. We're trying to streamline that.
I'm very much on the same side of the »conflict« like @ttasovac. For most applications, <pos>
and <gram type="pos">
and all the other elements threatened to be axed are mutually equivalent. There is no clear direction for users, which encoding to chose. This is completely against the gist of Lex0. To restate my comment in https://github.com/DARIAH-ERIC/lexicalresources/issues/31#issuecomment-420573222: »There should be one – and preferably only one – obvious way to do it.«
I see only one notable difference between the two encodings, though. In principle, you could type <pos>
but for obvious reasons you cannot type <gram type="pos">
. Alas, typing <pos>
and friends is not allowed in vanilla TEI and would need customization anyway.
@laurentromary and I discussed this today and I am happy to report that this long-standing ticket can now be closed. We've agreed on the simplified mechanism and the use of typed <gram>
elements instead of the more granular but non-exhaustive elements like <pos>
, <tns>
etc.
I have documented this in Section 2.3 on Grammatical Properties, within Chapter 2: Entries.
I will close this ticket now, but if you spot anything that you feel needs further explanation, fell free to open a new ticket.
Hi.
In Piotr's first go at the TEI Lex-0 customization, he did something which I very much approve of, he got rid of
case
gen
iType
mood
number
per
subc
tns
.Now, for the time being, I left
gen
because we have examples such as:and
but what I would really like to do is streamline this even further, and allow only
gram
.We currently use a lot of
pos
in our example (and I personally use pos exclusively in our dictionaries) but I think that for our purposesgramGrp/gram
would be a better fit, precisely because we may need to encode things like pluralia tantum or something like that which is really not describing a part of speech.So I just need some feedback on my proposal to allow only
gram
withingramGrp
(and then work on the use of@type
,@value
etc. for it.Keep in mind that this would mean that we can no longer have an example like this:
and that, as I said, we would have to recommend typology for, which is always hard, but I don't think it's tenable (especially for our generic tools down the line) to allow so many different elements within
gramGrp
.