The Text Encoding Initiative Guidelines
bansp commented 1 year ago

This is a request coming from the Lexical Resources Summit 2023 (DARIAH, Berlin), convened by @ttasovac and @laurentromary . The immediate context is the TEI Lex0 customisation but @JessedeDoes suggests that the request could also be helpful in the ParlaMint project.

Essence: it would be most useful to be able to use the datcat attributes for taxonomies, of any sort (datcat atts are not only a grammatical device, any longer). For that purpose, the <taxonomy> and <category> elements should be members of att.datcat.

When discussing the initial bundle of elements for the re-written datcat, a few months ago, we decided to start small and expand when there is a need. Now, the need comes from two well-established projects, and there is a good chance that the addition is going to be useful elsewhere as well.

Also, we said that, since taxonomies may use <equiv>, we'd see if a genuine need for the attribute class exists. Neither the ODD for Lex-0 nor the ODD for ParlaMint uses the tagdocs module. I think that this may qualify as genuine need, because requiring these carefully crafted ODDs to use the tagdocs module only for the sake of equiv/@url as the mechanism for aligning with external taxonomies is definitely far-fetched.

We're going to add a PR to this ticket, with some examples added to (at least) the att.datcat spec.

ttasovac commented 1 year ago

Just to add a small detail to what @bansp was saying. We are working on an edition of a Portuguese monolingual dictionary from the 18th century, and we are using taxonomy in the header to organize a hierarchy of domain labels in the dictionary. Each usg label in the dictionary points to a category in the taxonomy, but each category in the taxonomy points to an externally hosted ontology.

At the moment, we are using @corresp on each category to point to the external ontology, but I do believe that having @datcat in this context would be semantically more precise and therefore a better encoding choice than the generic @corresp mechanism.

sydb commented 1 year ago

Sounds quite reasonable to this linguistic ignoramus.

  1. Could someone post (or point to) an example of @datacat on <category>?
  2. Roughly how long do you think it might be before the PR is ready?
bansp commented 1 year ago

The PR needs a bit more love: where I mention "Morais dictionary", we probably want a project reference or at least a few more words and a link.

The resulting att.datcat is here: https://jenkins-paderborn.tei-c.org/view/LingSIG/job/TEIP5-LingSIG-tests/lastSuccessfulBuild/artifact/P5/release/doc/tei-p5-doc/en/html/ref-att.datcat.html

bansp commented 1 year ago

@anacastrosalgado is going to help and provide the missing reference, she said.

anacastrosalgado commented 1 year ago

@bansp Morais Silva, A. M. (1789). Diccionario da lingua portugueza composto pelo padre D. Rafael Bluteau, reformado, e accrescentado por Antonio de Moraes Silva, natural do Rio de Janeiro (vols. 1–2). Officina 730 de Simão Thaddeo Ferreira. MORDigital project (https://mordigital.fcsh.unl.pt/en/homepage/). The digital edition will be available via TEI Lex-0 Publisher at the end of the project.   Example:

bansp commented 1 year ago

Thanks, Ana. I've added the reference. Are you OK with the example used there? I think it came from you directly, but that was something like a year ago, so maybe you'd rather see some changes there while the Council is still in the process.

anacastrosalgado commented 1 year ago

@bansp Please, see if it can be like this (@ttasovac , @laurentromary also take a look, please). It is the example that we used yesterday on our presentation during the TEI conference.

The att.datcat attributes can be used for any sort of taxonomies. The example below illustrates their usefulness for describing usage domain labels in dictionaries showing a lexicographic article from a Portuguese legacy dictionary, the Morais dictionary [Morais Silva, A. M., (1789). Diccionario da lingua portugueza composto pelo padre D. Rafael Bluteau, reformado, e accrescentado por Antonio de Moraes Silva, natural do Rio de Janeiro, vols. 1–2. Officina 730 de Simão Thaddeo Ferreira. The digital edition will be available via TEI Lex-0 Publisher at the end of the MORDigital project (https://mordigital.fcsh.unl.pt/en/homepage/).

<!--  in the dictionary header    -->
      <taxonomy xml:id="domains">
         <category xml:id="domain.mathematical_sciences"
            valueDatcat="http://www.semanticweb.org/OntoDomLab-Math#MathematicalSciences http://vocabs.rossio.fcsh.unl.pt/morais_domains/0036">
            <catDesc xml:lang="en">
               <term>Mathematical Sciences</term>
               <gloss>Group of areas of study that includes, in addition to mathematics, those
                  academic disciplines that are primarily mathematical in nature but may not
                  be universally considered subfields of mathematics proper.</gloss>
            <catDesc xml:lang="pt">
               <term>Ciências Matemáticas</term>
            <category xml:id="domain.mathematics"
               valueDatcat="http://www.semanticweb.org/OntoDomLab-Math#Mathematics http://vocabs.rossio.fcsh.unl.pt/morais_domains/0024">
               <catDesc xml:lang="en">
               <catDesc xml:lang="pt">
               <category xml:id="domain.arithmetic"
                  valueDatcat="http://www.semanticweb.org/OntoDomLab-Math#Arithmetic http://vocabs.rossio.fcsh.unl.pt/morais_domains/0003">
                  <catDesc xml:lang="en">
                  <catDesc xml:lang="pt">
               <category xml:id="domain.geometry"
                  valueDatcat="http://www.semanticweb.org/OntoDomLab-Math#Geometry http://vocabs.rossio.fcsh.unl.pt/morais_domains/0018">
                  <catDesc xml:lang="en">
                  <catDesc xml:lang="pt">
<!-- inside an <entry> element: -->
<usg type="domain" valueDatcat="#domain.mathematics">Mathem.</usg>
<entry xmlns="http://www.tei-c.org/ns/1.0" xml:id="MORAIS.DLP.1.ORDENADA" type="mainEntry" xml:lang="pt">
   <form type="lemma">
   <metamark function="lemmaDelimiter">,</metamark>
      <gram type="pos" norm="NOUN">ſ.</gram>
      <gram type="gen">f.</gram>
   <sense xml:id="MORAIS.DLP.1.ORDENADA.s.1">
      <usg type="domain" valueDatcat="#domain.mathematics">Mathem.</usg>
      <def>linha recta tirada perpendicularmente do ponto da curva a ſeu eixo</def>
   <metamark function="senseDelimiter">.</metamark>
<entry xmlns="http://www.tei-c.org/ns/1.0" xml:id="MORAIS.DLP.1.TRIGONOMETRIA" type="mainEntry" xml:lang="pt">
   <form type="lemma">
   <metamark function="lemmaDelimiter">,</metamark>
      <gram type="pos" norm="NOUN">ſ.</gram>
      <gram type="gen">f.</gram>
   <sense xml:id="MORAIS.DLP.1.TRIGONOMETRIA.s.1">
      <!-- invisible domain -->
      <usg type="domain" valueDatcat="#domain.mathematics" resp="#Salgado"/>
      <def>parte da Mathematica , que enſina a reſolver os triangulos planos , e esfericos</def>
   <metamark function="senseDelimiter">.</metamark>

In the Morais dictionary, the relevant domain labels are organised in the header, getting referenced inside the dictionary, from usg elements. The vocabulary used for dictionary-internal labelling is in turn anchored in the MORDigital controlled vocabulary service of the NOVA University of Lisbon – School of Social Sciences and Humanities (NOVA FCSH).

bansp commented 1 year ago

I should have phrased my last comment differently :-) Like "is there something in the example right now that makes it utterly wrong (rather than not beautiful enough)" ;-) Because if the example and text are "not super" but nevertheless not blatantly lying, then I would dearly prefer not to edit anything there right now, because I simply don't have the time for it -- maybe in December, but maybe only in January, if I can help it. One very important thing to remember is that the use of Morais in that piece of documentation should be treated as accidental -- it is used only to illustrate one case where the <taxonomy> element uses DCR attributes. One short example, because the spec needs to be readable, rather than TL;DR-able.

The last Jenkins build shows some errors, but I'm not at all sure that the errors come from the newly added reference. I have now pushed a new commit and hope that it lights green, and the ticket/PR gets accepted for merging.

bansp commented 1 year ago

Update: the Jenkins build keeps failing, but it looks a bit like an incompatibility between some configuration item (path?) and some backwards-incompatible modification in a new release of the Guidelines.

I wonder if lines such as

[xslt] WARNING: file https://www.tei-c.org/Vault/P5/4.5.0/VERSION cannot be read, so links will probably be broken

can be taken as indicative of what's wrong (see the console output).

Will pester @peterstadler about this at some point, but only after he's had a bit of a breather after the conference...

bansp commented 1 year ago

... and, as usual, Peter has not failed. Thanks! I've put the fix in just so the build doesn't fail and we can finally see the result at https://jenkins-paderborn.tei-c.org/view/LingSIG/job/TEIP5-LingSIG-tests/lastSuccessfulBuild/artifact/P5/release/doc/tei-p5-doc/en/html/ref-att.datcat.html .

See issue TEIC/TEI#2472 for progress on the Council side.

bansp commented 1 year ago

That hasn't worked as planned, compare

old console output, 6 errors: https://jenkins-paderborn.tei-c.org/job/TEIP5-LingSIG-tests/13/parsed_console/

new output, 8 errors: https://jenkins-paderborn.tei-c.org/job/TEIP5-LingSIG-tests/14/parsed_console/

-- so let me just wait for a fix by the Council.

ebeshero commented 1 year ago

@bansp I took a very quick look at the bug report and saw this issue right away: "ERROR: Guidelines.epub: OPS/XHTML file OPS/ref-att.calendarSystem.html is missing"

It looks like the build is missing a crucial file for ref-att.calendarSystem ?

ebeshero commented 1 year ago

@bansp That's not your doing, of course--just a recognition that the build problem is likely to do with activity last week and this on a different PR (uh oh...): https://github.com/TEIC/TEI/pull/2435

@raffazizzi and @sydb should be able to help here! I'll look in later--I'm headed back to the university trenches for the next several hours.

bansp commented 1 year ago

@ebeshero Thanks for giving it a check :-) I've withdrawn the modification and will just wait for whatever you guys end up doing, and will update the fork then. There's no need to divert Raff's or Syd's attention.

ebeshero commented 1 year ago

@bansp I think the proverbial dust has settled from yesterday's activities on the other PR! It's probably safe to update your branch now. But I also think it may be safe to ask our Council reviewers to check things out too.

bansp commented 1 year ago

Thanks, Elisa. No hurry on this end. I'll do my best to react to potential comments by the reviewers.

bansp commented 1 year ago

OTOH, there's no movement yet, in the dev branch of either TEI or Stylesheets. I'll just check back in a day or two :-)

bansp commented 11 months ago

Updated the pull request with the content coming from issue TEIC/TEI#2480 but still no go, at least not in the Paderborn Jenkins. It's the first time since I can remember that the build tree has been broken for so long. Feels weird. I understand it's because we're waiting for some upstream sanity but I'm not sure that that is sensible. They must know they've broken stuff and since they haven't bothered to fix it, shouldn't we go around them? As Martin suggests in TEIC/TEI#2472 .

ebeshero commented 11 months ago

@bansp Sorry for the long wait, but for us it is only the documentation build that breaks. The "upstream sanity" you refer to is a decision that we will make when Council meets next Friday October 13. We need Council discussion and consensus on the best path forward to resolve #2473.

bansp commented 11 months ago

Thanks, Elisa. By upstream sanity, I was referring to what seems a happy go lucky move by the Debian team, if I understood Martin correctly. And leaving the matters unchanged despite a hiccup.

I realise that, for the Council, waiting is a reasonable option, up to a limit, and the costs are arguably low in this case.

Maybe a different make flow is what can be done in our case. Will ask Peter about that.