DARIAH-ERIC / lexicalresources

Data space of the DARIAH Lexical Resources Working Group
https://dariah-eric.github.io/lexicalresources/
BSD 2-Clause "Simplified" License
18 stars 24 forks source link

Add etymology section from Jack's and Laurent's Paper #26

Open ttasovac opened 6 years ago

ttasovac commented 6 years ago

Jack, what's your GitHub user name? I'd like to assign this to you.

ttasovac commented 5 years ago

We haven't discussed this in great detail, but I need us to jumpstart this — also because my students in Lisbon need to encode some etymologies today in TEI Lex-0.

For the time being, I think we need:

We will definitely discuss this and what our final recommendation will be. This is just to start the process.

ttasovac commented 5 years ago

Merci, @laurentromary . I'll take a look.

One more general question — for you or anybody:

This is from Johnson's dictionary:

<etym type="borrowing"><pc>[</pc><cit type="etymon">
        <form xml:lang="grc"><orth>λεξικὸν</orth></form>
    </cit> and <cit type="etymon">
        <form xml:lang="grc"><orth>γράφω</orth></form>
    </cit>; <cit type="etymon">
        <form xml:lang="fr">lexicographe</form>
        <pc>,</pc>
        <lang value="fr">Fr.</lang>
    </cit><pc>]</pc>
</etym>
laurentromary commented 5 years ago

For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

laurentromary commented 5 years ago

Should not you put a <lbl> around "and" in your example?

ttasovac commented 5 years ago

For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Sure. I just don't like the fact that we have two-letter codes for modern languages and then a three-letter code for an ancient language, but I know that my 'liking' things is totally beside the point! :smiley:

Should not you put a around "and" in your example?

Yes, I was rushing... I think it will be a hard sell (I can imagine the questions starting with: "why is this a label"?) but yes, we don't like mixed content etc.

But, if I may ask again: are you ok with xml:lang on form and not on cit?

laurentromary commented 5 years ago

Do we need to take a decision on the fly now? My stomach relates this to @xml:lang on <entry> (and not on entry/form).

ttasovac commented 5 years ago

We can't and don't need to make the final decision now. But I need to present something — as a temporary solution for our exercises today (we start in an hour and a half). I can put the xml:lang back on cit for today, but I still think we need to think about it a little more...

laurentromary commented 5 years ago

Absolutely. One element is the notion of when xml:lang is used to indicate the object language (such as in entry)

WGBS2 commented 5 years ago

Pour BasNum, j’utilise toujours les codes pays à 3 lettres afin de réduire l’ambiguïté. J’utilise xml:lang sur entry, mais je le trouve un peu redondant du fait que meme si un mot est d’origine étranger, Furetière/Basnage le considérait comme un mot du français - voir aile (prononcé ale) pour la bière anglaise apprécié par les jeunes parisiens de la fin XVII

Geoffrey

Le 3 juil. 2019 à 18:19, laurentromary notifications@github.com a écrit :

Absolutely. One element is the notion of when xml:lang is used to indicate the object language (such as in entry)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26?email_source=notifications&email_token=AD63DP5CH67BFDRLN7CCQNTP5TGPPA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZE62HA#issuecomment-508161308, or mute the thread https://github.com/notifications/unsubscribe-auth/AD63DP2RDKHMWYWBQBP3RCDP5TGPPANCNFSM4FS4IJCA.

ttasovac commented 5 years ago

Two remarks:

1. text nodes

Should we remove textNode from the content model of etym? It would be nice to get rid of mixed content, but, on the other hand, we can't expect that all dictionaries will encode etymologies deeply. Some may simply mark up the etym section and leave everything inside as text.

My initial thought here is that yes, we should disallow textNodes, but recommend in the narrative guidelines that those who do not go granular simply add a <note> inside <etym>, i.e.

<etym>
    <note>[λεξικὸν and γράφω; lexicographe, Fr.]</note>
</etym>

2. default type

We will need to discuss the typing. At the moment we put the types from Laurent's and Jack's paper, but those will need narrative explanations in the context of TEI Lex-0 because they may not be self-evident. We need to leave that longer conversation for later. (@anacastrosalgado and I will try to look at how our current typology works with the Portuguese Academy dictionary and will report back.)

But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.

Any thoughts @laurentromary, @iljackb?

laurentromary commented 5 years ago

I like the idea of the baseline provided with <note>. We should also signal a default way of marking up text nodes not identified as etymological components. Should we use <seg>, or be more prescriptive right away with specific elements (<pc>, <lbl>, etc.) or, like I suggested on another ticket use <alternate> models depending on the nature of the source encoding.

ttasovac commented 5 years ago

I think we should preserve <pc> and <lbl> as specific elements for punctuation (when serving as delimiters between elements) and explicit labels. The text nodes not identified as etymological components should be placed in a different element.

Back in Berlin we were considering <desc> which we currently do not allow in TL0. But <seg> may be better:

seg (arbitrary segment) represents any segmentation of text below the ‘chunk’ level."

I don't know what a chunk is but I like that segs are arbitrary. Whereas:

<desc> (description) contains a brief description of the object documented by its parent element, typically a documentation element or an entity.

implies a complete description, not fragments of it.

So, yes, I'd actually prefer <seg> to <desc>.

xlhrld commented 5 years ago

In our TEI Lex-0 Etym paper we (@iljackb, @laurentromary and me) propose seg[@type="desc"] for portions of text that cannot be marked up using any more specific element, yes. These things are typically no sound descriptions of anything but rather seem like arbitrary cut-offs from the running text (citing from the paper, e.g. »Others have proposed an etymology«, »with intervocalic«, »becoming«).

NB: To me, the whole business with avoiding mixed content feels a bit like over-engineering for prose centered texts such as many etymologies. It doesn't provide much benefit to the modeling proper. Basically you just sort of confirm that yes, I didn't forget to mark this up as something more specific, it's just any <seg> of things I don't care about. It may be beneficial for certain parsers to avoid mixed content, though.

TomazErjavec commented 5 years ago

I just discovered that some of my Lex0 dictionaries (cf. https://gitlab.clarin.si/et/tei-lex0-sl) are no loger valid, because now etym/@type is required. I now found this issue and comment:

But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.

  1. I think it is more "XML like" that if you don't know a value for some attribute, you don't write the attribute, i.e. why make it required and then have a "I don't know" value, rather than it being optional?

  2. Note that the documentation is rife with examples of etym without @type, so right now it is pretty misleading what is ok and what not. I'd also bet (1 beer) that for the most cases of legacy dictionary it won't be clear what kind of etymology an etym represents, or at least not simply machine inferrable, so the @type will the rather an exception than a rule.

WGBS2 commented 5 years ago

I totally agree. Our are word histories, and more story than history. I shall only try classifying, using type, once I have full encoding and talk with real etymologists. I must say, I am wondering whether I can even attempt to stay in TLex0 as it is simply too simplistic for heritage dictionaries.

Le 5 juil. 2019 à 20:19, Tomaž Erjavec notifications@github.com a écrit :

I just discovered that some of my Lex0 dictionaries (cf. https://gitlab.clarin.si/et/tei-lex0-sl https://gitlab.clarin.si/et/tei-lex0-sl) are no loger valid, because now etym/@type https://github.com/type is required. I now found this issue and comment:

But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.

I think it is more "XML like" that if you don't know a value for some attribute, you don't write the attribute, i.e. why make it required and then have a "I don't know" value, rather than it being optional?

Note that the documentation is rife with examples of etym without @type, so right now it is pretty misleading what is ok and what not. I'd also bet (1 beer) that for the most cases of legacy dictionary it won't be clear what kind of etymology an etym represents, or at least not simply machine inferrable, so the @type will the rather an exception than a rule.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26?email_source=notifications&email_token=AD63DP2BCBRUMKG2VW7SJ3LP56GBJA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZKCV7Q#issuecomment-508832510, or mute the thread https://github.com/notifications/unsubscribe-auth/AD63DPZ6WHZXTC7OBKKLAUDP56GBJANCNFSM4FS4IJCA.

anacastrosalgado commented 4 years ago

Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )? I would appreciate your help.

Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )? cota2

I would appreciate your help.

`<entry type=“monolexicalWord" xml:lang="pt" xml:id=“cota_b">

cota kˈɔtɐ :2
s. f.

<etym type XXXX De origem obscura XXXX <sense xml:id=“cota_1" n="1"> `

laurentromary commented 4 years ago

If it alternates with what would be an <etym>, maybe we should be going with one here as well, but typed undefined. <etym type="undefined">

iljackb commented 4 years ago

So most simply I would do:

       <etym>
          <seg type="desc">De origem obscura</seg>
       </etym>

If you want and/or think it would be useful, you could also put a value in <etym @type> such as "unknown", "undefined", "obscure", etc. But you don't necessarily need that as the term in is enough to be able to search for where the etymology isn't known.

On Wed, Sep 18, 2019 at 6:56 AM laurentromary notifications@github.com wrote:

If it alternates with what would be an , maybe we should be going with one here as well, but typed undefined.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26?email_source=notifications&email_token=ABYQ2HH6VLZCCKGTBQSF5YLQKGYJRA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66ZMBA#issuecomment-532518404, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYQ2HHFHWR5W3FWFHTGTI3QKGYJRANCNFSM4FS4IJCA .

laurentromary commented 4 years ago

I would imagine the term has variants and relying on a typing would univocally help finding the appropriate content.

ambs commented 3 years ago

I know this is kind of off-topic, but can I ask why this aversion to mixed content? That is one of the main reason I use to sell XML instead of a serializing language like JSON for Digital Humanities.

ttasovac commented 3 years ago

hi @ambs,

i wouldn't call it an aversion. the only concern is that sometimes mixed content is more difficult to process, I know i've run into issues with white spaces in html that were really difficult to solve (and would differ between browsers etc.) but all in all I think everybody will agree with you that mixed content is sometimes a must, is often needed in humanistic texts (i.e. narratives, not tabular data), and yes, that's an argument in favor of XML over JSON, for sure.

tklampfl commented 3 years ago

I have one question concerning etymologies in TEILex-0: In the paper of Bowers / Romary (Bowers / Romary) referencing with pRef and oRef in etymological information plays an important role. However, in the schema of TEILex-0 both elements are excluded: grafik So, I am irritated. What are the reasons for exluding pRef and oRef and for using ref instead?

Thank you for your answer.

Best wishes, Thomas

ttasovac commented 3 years ago

Etymology has not been officially added to TEI Lex-0 yet for no other reason than a lack of time on part of everybody involved. When etymology is finally added and documented properly, pRef and oRef are unlikely to make a comeback because we already reached a consensus that having specific elements for orthographic references and pronunciation references is unnecessary from the point of view of TEI Lex-0 since we can use typed ref elements for that.

tklampfl commented 3 years ago

Thank you very much for your answer. So, I will use ref instead to meet the requirements of TEI Lex-0.

laurentromary commented 3 years ago

If you’re not in the hurry, we need to finalise a paper on this by the end of the month. I could send you a stable draft by then. Laurent

Le 11 mars 2021 à 15:25, tklampfl @.***> a écrit :

Thank you very much for your answer. So, I will use ref instead to meet the requirements of TEI Lex-0.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/26#issuecomment-796773746, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH5B3ZDEN6RXB5JTMBAAATTDDHEHANCNFSM4FS4IJCA.

iljackb commented 3 years ago

Hi Thomas,

Just to give a preview of how it is different in Lex0 Etym, If you are encoding a declaration of an etymon, cognate or derivative, the format is still within as in the first paper, but it with

and /:

           <cit type="etymon" xml:lang="pt">

              <form>

                 <orth>humano</orth>
              </form>

           </cit>

But if it is a cross reference (such as the type that might occur in running text), that is when you would use (within ), e.g. as follows:

....<xr type="related" subtype="etymon" xml:id="etym-dorsum" xml:lang="la"

dorsum....

If this is a pronunciation form you can use @notation (as you can with

), otherwise it is assumed to be orthographic or simply unspecified. So whether you should use or not according to our recommendations depends on the function of the form.. This is just to let you know the difference of how we are treating these in the new guidelines. But I see Laurent responded so the details will best be explained in the paper itself when you get it. Best, Jack On Thu, Mar 11, 2021 at 3:25 PM tklampfl ***@***.***> wrote: > Thank you very much for your answer. So, I will use ref instead to meet > the requirements of TEI Lex-0. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . >