Open ttasovac opened 6 years ago
We haven't discussed this in great detail, but I need us to jumpstart this — also because my students in Lisbon need to encode some etymologies today in TEI Lex-0.
For the time being, I think we need:
We will definitely discuss this and what our final recommendation will be. This is just to start the process.
Merci, @laurentromary . I'll take a look.
One more general question — for you or anybody:
cit type="etymon"/form
, I think it would be more natural to put the xml:lang
on the form rather than the cit
. Would you be ok with that? el
is Modern Greek, which wouldn't be appropriate in the following example:This is from Johnson's dictionary:
<etym type="borrowing"><pc>[</pc><cit type="etymon">
<form xml:lang="grc"><orth>λεξικὸν</orth></form>
</cit> and <cit type="etymon">
<form xml:lang="grc"><orth>γράφω</orth></form>
</cit>; <cit type="etymon">
<form xml:lang="fr">lexicographe</form>
<pc>,</pc>
<lang value="fr">Fr.</lang>
</cit><pc>]</pc>
</etym>
For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Should not you put a <lbl>
around "and" in your example?
For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Sure. I just don't like the fact that we have two-letter codes for modern languages and then a three-letter code for an ancient language, but I know that my 'liking' things is totally beside the point! :smiley:
Should not you put a
around "and" in your example?
Yes, I was rushing... I think it will be a hard sell (I can imagine the questions starting with: "why is this a label"?) but yes, we don't like mixed content etc.
But, if I may ask again: are you ok with xml:lang
on form and not on cit
?
Do we need to take a decision on the fly now? My stomach relates this to @xml:lang
on <entry>
(and not on entry/form
).
We can't and don't need to make the final decision now. But I need to present something — as a temporary solution for our exercises today (we start in an hour and a half). I can put the xml:lang
back on cit
for today, but I still think we need to think about it a little more...
Absolutely. One element is the notion of when xml:lang is used to indicate the object language (such as in entry)
Pour BasNum, j’utilise toujours les codes pays à 3 lettres afin de réduire l’ambiguïté. J’utilise xml:lang sur entry, mais je le trouve un peu redondant du fait que meme si un mot est d’origine étranger, Furetière/Basnage le considérait comme un mot du français - voir aile (prononcé ale) pour la bière anglaise apprécié par les jeunes parisiens de la fin XVII
Geoffrey
Le 3 juil. 2019 à 18:19, laurentromary notifications@github.com a écrit :
Absolutely. One element is the notion of when xml:lang is used to indicate the object language (such as in entry)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26?email_source=notifications&email_token=AD63DP5CH67BFDRLN7CCQNTP5TGPPA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZE62HA#issuecomment-508161308, or mute the thread https://github.com/notifications/unsubscribe-auth/AD63DP2RDKHMWYWBQBP3RCDP5TGPPANCNFSM4FS4IJCA.
Two remarks:
1. text nodes
Should we remove textNode
from the content model of etym
? It would be nice to get rid of mixed content, but, on the other hand, we can't expect that all dictionaries will encode etymologies deeply. Some may simply mark up the etym section and leave everything inside as text.
My initial thought here is that yes, we should disallow textNodes, but recommend in the narrative guidelines that those who do not go granular simply add a <note>
inside <etym>
, i.e.
<etym>
<note>[λεξικὸν and γράφω; lexicographe, Fr.]</note>
</etym>
2. default type
We will need to discuss the typing. At the moment we put the types from Laurent's and Jack's paper, but those will need narrative explanations in the context of TEI Lex-0 because they may not be self-evident. We need to leave that longer conversation for later. (@anacastrosalgado and I will try to look at how our current typology works with the Portuguese Academy dictionary and will report back.)
But for the time being, with @type
being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.
Any thoughts @laurentromary, @iljackb?
I like the idea of the baseline provided with <note>
. We should also signal a default way of marking up text nodes not identified as etymological components. Should we use <seg>
, or be more prescriptive right away with specific elements (<pc>
, <lbl>
, etc.) or, like I suggested on another ticket use <alternate>
models depending on the nature of the source encoding.
I think we should preserve <pc>
and <lbl>
as specific elements for punctuation (when serving as delimiters between elements) and explicit labels. The text nodes not identified as etymological components should be placed in a different element.
Back in Berlin we were considering <desc>
which we currently do not allow in TL0. But <seg>
may be better:
seg
(arbitrary segment) represents any segmentation of text below the ‘chunk’ level."
I don't know what a chunk is but I like that segs are arbitrary. Whereas:
<desc>
(description) contains a brief description of the object documented by its parent element, typically a documentation element or an entity.
implies a complete description, not fragments of it.
So, yes, I'd actually prefer <seg>
to <desc>
.
In our TEI Lex-0 Etym paper we (@iljackb, @laurentromary and me) propose seg[@type="desc"]
for portions of text that cannot be marked up using any more specific element, yes. These things are typically no sound descriptions of anything but rather seem like arbitrary cut-offs from the running text (citing from the paper, e.g. »Others have proposed an etymology«, »with intervocalic«, »becoming«).
NB: To me, the whole business with avoiding mixed content feels a bit like over-engineering for prose centered texts such as many etymologies. It doesn't provide much benefit to the modeling proper. Basically you just sort of confirm that yes, I didn't forget to mark this up as something more specific, it's just any <seg>
of things I don't care about. It may be beneficial for certain parsers to avoid mixed content, though.
I just discovered that some of my Lex0 dictionaries (cf. https://gitlab.clarin.si/et/tei-lex0-sl) are no loger valid, because now etym/@type is required. I now found this issue and comment:
But for the time being, with
@type
being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.
I think it is more "XML like" that if you don't know a value for some attribute, you don't write the attribute, i.e. why make it required and then have a "I don't know" value, rather than it being optional?
Note that the documentation is rife with examples of etym
without @type
, so right now it is pretty misleading what is ok and what not. I'd also bet (1 beer) that for the most cases of legacy dictionary it won't be clear what kind of etymology an etym
represents, or at least not simply machine inferrable, so the @type
will the rather an exception than a rule.
I totally agree. Our
Le 5 juil. 2019 à 20:19, Tomaž Erjavec notifications@github.com a écrit :
I just discovered that some of my Lex0 dictionaries (cf. https://gitlab.clarin.si/et/tei-lex0-sl https://gitlab.clarin.si/et/tei-lex0-sl) are no loger valid, because now etym/@type https://github.com/type is required. I now found this issue and comment:
But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.
I think it is more "XML like" that if you don't know a value for some attribute, you don't write the attribute, i.e. why make it required and then have a "I don't know" value, rather than it being optional?
Note that the documentation is rife with examples of etym without @type, so right now it is pretty misleading what is ok and what not. I'd also bet (1 beer) that for the most cases of legacy dictionary it won't be clear what kind of etymology an etym represents, or at least not simply machine inferrable, so the @type will the rather an exception than a rule.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26?email_source=notifications&email_token=AD63DP2BCBRUMKG2VW7SJ3LP56GBJA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZKCV7Q#issuecomment-508832510, or mute the thread https://github.com/notifications/unsubscribe-auth/AD63DPZ6WHZXTC7OBKKLAUDP56GBJANCNFSM4FS4IJCA.
Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )? I would appreciate your help.
Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )?
I would appreciate your help.
`<entry type=“monolexicalWord" xml:lang="pt" xml:id=“cota_b">
<etym type XXXX De origem obscura XXXX <sense xml:id=“cota_1" n="1"> `
If it alternates with what would be an <etym>
, maybe we should be going with one here as well, but typed undefined. <etym type="undefined">
So most simply I would do:
<etym>
<seg type="desc">De origem obscura</seg>
</etym>
If you want and/or think it would be useful, you could also put a value in
<etym @type> such as "unknown", "undefined", "obscure", etc. But you don't
necessarily need that as the term in
On Wed, Sep 18, 2019 at 6:56 AM laurentromary notifications@github.com wrote:
If it alternates with what would be an
, maybe we should be going with one here as well, but typed undefined. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26?email_source=notifications&email_token=ABYQ2HH6VLZCCKGTBQSF5YLQKGYJRA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66ZMBA#issuecomment-532518404, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYQ2HHFHWR5W3FWFHTGTI3QKGYJRANCNFSM4FS4IJCA .
I would imagine the term has variants and relying on a typing would univocally help finding the appropriate content.
I know this is kind of off-topic, but can I ask why this aversion to mixed content? That is one of the main reason I use to sell XML instead of a serializing language like JSON for Digital Humanities.
hi @ambs,
i wouldn't call it an aversion. the only concern is that sometimes mixed content is more difficult to process, I know i've run into issues with white spaces in html that were really difficult to solve (and would differ between browsers etc.) but all in all I think everybody will agree with you that mixed content is sometimes a must, is often needed in humanistic texts (i.e. narratives, not tabular data), and yes, that's an argument in favor of XML over JSON, for sure.
I have one question concerning etymologies in TEILex-0:
In the paper of Bowers / Romary (Bowers / Romary) referencing with pRef
and oRef
in etymological information plays an important role.
However, in the schema of TEILex-0 both elements are excluded:
So, I am irritated. What are the reasons for exluding pRef
and oRef
and for using ref
instead?
Thank you for your answer.
Best wishes, Thomas
Etymology has not been officially added to TEI Lex-0 yet for no other reason than a lack of time on part of everybody involved. When etymology is finally added and documented properly, pRef and oRef are unlikely to make a comeback because we already reached a consensus that having specific elements for orthographic references and pronunciation references is unnecessary from the point of view of TEI Lex-0 since we can use typed ref elements for that.
Thank you very much for your answer. So, I will use ref
instead to meet the requirements of TEI Lex-0.
If you’re not in the hurry, we need to finalise a paper on this by the end of the month. I could send you a stable draft by then. Laurent
Le 11 mars 2021 à 15:25, tklampfl @.***> a écrit :
Thank you very much for your answer. So, I will use ref instead to meet the requirements of TEI Lex-0.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26#issuecomment-796773746, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH5B3ZDEN6RXB5JTMBAAATTDDHEHANCNFSM4FS4IJCA.
Hi Thomas,
Just to give a preview of how it is different in Lex0 Etym, If you are
encoding a declaration of an etymon, cognate or derivative, the format is
still within
<cit type="etymon" xml:lang="pt">
<form>
<orth>humano</orth>
</form>
</cit>
But if it is a cross reference (such as the type that might occur in
running text), that is when you would use (within
....<xr type="related" subtype="etymon" xml:id="etym-dorsum" xml:lang="la"
dorsum....
If this is a pronunciation form you can use @notation (as you can with
Jack, what's your GitHub user name? I'd like to assign this to you.