DARIAH-ERIC / lexicalresources

Data space of the DARIAH Lexical Resources Working Group
https://dariah-eric.github.io/lexicalresources/
BSD 2-Clause "Simplified" License
18 stars 24 forks source link

Provide encoding proposal for cantor, a n. m., f. #55

Closed laurentromary closed 4 years ago

laurentromary commented 5 years ago

Following an example from Ana Salgado. My suggestion would be to have two <form type="inflected"> in conjunction to one lemma with cantor.

laurentromary commented 5 years ago

Following a discussion with Toma, we have the following thoughts:

<entry>
   <!-- cantor,a n. m., f. -->
            <form type="lemma">
               <orth>cantor</orth>
               <gramGrp>
                  <pos>n</pos>
               </gramGrp>
            </form>
            <form type="inflected">
               <orth>cantor</orth>
               <gramGrp>
                  <gen>m</gen>
               </gramGrp>
            </form>
            <form type="inflected">
               <orth>cantora</orth>
               <gramGrp>
                  <gen>f</gen>
               </gramGrp>
            </form>
         </entry>
 <entry>
         <form type="lemma">
               <orth>cantor,a</orth>
               <gramGrp>
                  <pos>n.</pos>
                  <gen>m.</gen>
                  <pc>,</pc>
                  <gen>f.</gen>
               </gramGrp>
            </form>
         </entry>
<entry>
            <form type="lemma">
               <orth>cantor</orth>
               <pc>,</pc>
               <form type="ending">
                  <orth>a</orth>
               </form>
               <gramGrp>
                  <pos>n.</pos>
                  <gen>m.</gen>
                  <pc>,</pc>
                  <gen>f.</gen>
               </gramGrp>
            </form>
         </entry>
ambs commented 5 years ago

Hopefully we should find one single way to encode things (semantically, never visually). Adding gramGrp to the forms, instead of being outside as TEI proposes, looks interesting.

Nevertheless, I would not add a different version for the male. That one would be in the lemma, and have only one inflected form.

It is clear that, when inflecting, everything should be inherited from the lemma, but replacing whatever is specifically mentioned (in this case, the genre).

ttasovac commented 5 years ago

Hi @ambs, lexical vs. editorial view is a big challenge for retrodigitized dictionaries. While I would also prefer that we always go for the lexical view, we cannot really enforce it 100% in TEI Lex-0 because some projects will insist on being able to represent accurately the dictionary as it appears in its print edition. Beauty is in the eyes of the encoder 😄.

If you can go with the lexical view with the DACL, I'd be all in favor of it, but that would also mean changing the order in which elements appear OR you would have to do some additional transformations to display things as they were in the original dictionary, but I'm not sure how sustainable that is.

P.S. We haven't met yet in person, but I know you through Ana and let me just say how delighted I am that you're helping out with the conversion of the DACL. I'm also a big fan of the eXist-based backend you built for the Academy!

ambs commented 5 years ago

Hi, @ttasovac. I hope we meet someday. Who knows on eLex conference.

Regarding display vs content, I understand the struggle. I remember learning MathML, and finding it awful to see the its display encoding (and for math, that is something so formal, it makes less sense than for a dictionary).

For DACL, as we are not aiming to display it exactly as it was printed, I prefer to go for the structural encoding, even if then I need to rework the XML to print it properly. :-)

anacastrosalgado commented 5 years ago

My dears @laurentromary and @ttasovac

We must use <gram type="pos"> or we can use just <pos>?

<entry type="derivativeWord">
<form type="lemma">
<orth>ensonado</orth>
<form type="lemma">
<orth>ensonado</orth>
<gramGrp>
<gram type="pos" ud:norm “NOUN”>n.</gram>
<gram type="gen">f.</gram>
</gramGrp>
</form>
<form type="inflected">
<orth>ensonado</orth>
<gramGrp>
<gram type="gen">m.</gram>
</gramGrp>
</form>
<form type="inflected">
<orth>ensonada</orth>
<gramGrp>
<gram type="gen">f.</gram>
</gramGrp>
</form>
</entry>

@ambs

laurentromary commented 5 years ago

Well... Since you ask. TEI Lex 0 recommends the <gram> version when transforming data into a single target format (e.g. for the Elexis use case). This is too disruptive to my view and would strongly advocate to keep <pos>. :-}

ambs commented 5 years ago

I would go for

<gram>
    <pos>n.</pos>
    <gen>m.</gen>
</gram>

and when there are distinct POS, use

<gramGrp>
   <gram>
      <pos>n.</pos>
      <gen>m.</gen>
   </gram>
   <gram>
      <pos>adj.</pos>
   </gram>
</gramGrp>

Looks good?

iljackb commented 5 years ago

You can’t have these elements inside .

When there are 2 or more POS for an entry, you can do one of two things:

1) put the grammar information inside

     <entry>

        ….
        <sense n="1">
           <gramGrp>
              <pos>n.</pos>
              <gen>m.</gen>
           </gramGrp>
           ….
        </sense>
        <sense n="2">
           <gramGrp>
              <pos>adj.</pos>
           </gramGrp>
           ….
        </sense>
     </entry>

2) you can just have multiple ’s in a row (the presence of more than one indicated that there are contrasting features between the two)

     <entry>

        ...
        <gramGrp>
              <pos>n.</pos>
              <gen>m.</gen>
        </gramGrp>   
       <!— <lbl>here if you want —>
        <gramGrp>
              <pos>adj.</pos>
        </gramGrp>   
       ….
     </entry>

if you want, you could put a between the two to have a function word like “or” “and” etc…

On Jun 12, 2019, at 4:32 PM, Alberto Simões notifications@github.com wrote:

I would go for

n. m.

and when there are distinct POS, use
n. m. adj.


Looks good?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <https://github.com/DARIAH-ERIC/lexicalresources/issues/55?email_source=notifications&email_token=ABYQ2HH6NCLYR6QGZVWDCT3P2ECGXA5CNFSM4HVB4IYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQT4QQ#issuecomment-501300802>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYQ2HEN5CVOZZX7PSZAAPDP2ECGXANCNFSM4HVB4IYA>.
anacastrosalgado commented 5 years ago

I think I'd prefer the first option, @iljackb

ambs commented 5 years ago

ok, a sequence of gramGrps look good. As for lbls, I am running from them.

iljackb commented 5 years ago

@Ana Salgado anacastrosalgado@gmail.com, good choice the first is the most conventional way to do it :-)

On Wed, Jun 12, 2019 at 4:47 PM Ana de Castro Salgado < notifications@github.com> wrote:

I think I'd prefer the first option, @iljackb https://github.com/iljackb Meanwhile, 'I'm going crazy...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DARIAH-ERIC/lexicalresources/issues/55?email_source=notifications&email_token=ABYQ2HCXJDTB3K3YRJ373ATP2ED6JA5CNFSM4HVB4IYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQVPWA#issuecomment-501307352, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYQ2HFD3KCI4UGYSDXY6ALP2ED6JANCNFSM4HVB4IYA .

ambs commented 5 years ago

I suspect we will have situations both for having it on different senses, and some other where the same sense will need more than one (and in that situation, use a sequence of gramGrp elements).

@anacastrosalgado, go for it :)

ana commented 5 years ago

hi @iljackb, replying by mail seems to be somehow broken and you're notifying me (@ana) instead of @anacastrosalgado

anacastrosalgado commented 4 years ago

I think you can close this issue...

  <entry type="derivativeWord" xml:lang="pt" xml:id="antepassado.1" n="1">
            <form type="lemma">
               <orth>antepassado</orth>
            </form>
            <form type="inflected">
               <orth>antepassado</orth>
               <pron>ɐ̃tɨpɐsˈadu</pron>
               <gramGrp>
                  <gram type="gen">m.</gram>
               </gramGrp>
            </form>
            <form type="inflected">
               <orth>antepassada</orth>
               <gramGrp>
                  <gram type="gen">f.</gram>
               </gramGrp>
               <pron>ɐ̃tɨpɐsˈadɐ</pron>
            </form>
            <gramGrp>
               <gram type="pos" norm="ADJ">adj.</gram></gramGrp>

This example shows that when a specific inflected form is featured in the entry it should be clearly defined as an independent form, and have enough information about the inflected type (in this case, that the item is a feminine form). For the grammatical information, the TEI Lex-0 standard suggests the use of the gramGrp tag.