Digital-Humanities-Quarterly / dhq-journal

DHQ is an open-access, peer-reviewed journal of digital humanities.
http://www.digitalhumanities.org/dhq/
11 stars 5 forks source link

natural language identification not handled that well #74

Open sydb opened 6 months ago

sydb commented 6 months ago

The dhq2html.xsl program (which I was poking at for some other reason), in the various templates for <note>, sets the language by looking at ancestor::tei:text/@xml:lang. Since @xml:lang can appear on any one of <dhq:abstract>, <div>, <dhq:example>, <floatingText>, <foreign>, <q>, <quote>, <dhq:teaser>, <term>, <text>, <title>, or <word>, all of which can have <note> as a descendant, this seems prone to error. That is, if an input document contained

<text xml:lang="es">
  <div>
    … un montón de cosas …
    <quote xml:lang="en">He will always put his own interests, and
      gratifying his own ego, ahead of everyting else, including
      the country’s interest.<note>Bill Barr was no longer the
      Attorney General when he said this.</note></quote>
  </div>
</text>

the output <note> will be flagged as being in Spanish, not English. There is a counter argument that asks what about the following:

<text xml:lang="es">
  <div>
    … un montón de cosas …
    <quote xml:lang="en">He will always put his own interests, and
    gratifying his own ego, ahead of everyting else, including
    the country’s interest.<note>Bill Barr ya no era el Fiscal
    General cuando dijo esto.</note></quote>
  </div>
</text>

for which the <note> would “correctly” be flagged as being in Spanish. My response to this counter argument is that this passage is, per the rules of XML (over which we have no control), incorrectly encoded. If we are going to use @xml:lang, we have to follow the spec.

To be fair, this is not a current problem[1] and may well never happen. But seems to me we should guard against it anyway.

Notes

[1] There is only one article which contains a case of //text[@xml:lang]//*[@xml:lang]//*[@xml:lang], and in that case the ultimate @xml:lang is superfluous. See articles/000251/000251.xml circa line 299. (I would correct this myself but I am not sure if it should be encoded with <quote xml:lang="grc"> or <quote><foreign xml:lang="grc"> — there are lots of cases of each method.) [2] I think this issue should be fixed before #14 is handled. (Although only <said> and <p> are left, the rest already have @xml:lang.) [3] I suspect this situation is a hold-over from a previous era when @xml:lang pretty much only appeared on <text> and <foreign>, in which case the current method makes sense. If instead it is the case that there is an editorial rule that the natural language of an annotation has to match the natural language of its ancestor <text>, then I think that should be schema-enforced. (I.e., require that if there is an ancestor element other than <text> that has an @xml:lang, a <note> must have an @xml:lang that matches the one on its ancestor <text>.) In which case this ticket could be closed-won’t-fix or could be fixed, the results would be the same. [4] BTW, this is also error-prone because there are some 50 texts that have <text>s inside <text>, in which case ancestor::tei:text/@xml:lang returns 2 items, although one of them is the empty string. That could cause problems someday if some code expects to use whatever it returns for something other than a string.

amclark42 commented 5 months ago

I just discovered that DHQ does not in fact use the @lang or @xml:lang attributes to indicate when language shifts in HTML article content. I have to stress that using a class is not enough. For accessibility, we have to use one of the designated attributes.

(Sorry to butt in on your issue, Syd. Both of our problems are tangled up in the HTML-producing side of things, so.)

sydb commented 5 months ago

I think that — in the static site, at least — we should use both @lang and @xml:lang, and maybe leave @class, as well. My logic is that the only downside is filesize, but given that we already stuff all the CSS, JS, and images in there anyway, adding a few attrs will not make any significant difference.