TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
278 stars 88 forks source link

`@assertedValue` of `<certainty>` should also accept pointers #2067

Open michalkozak opened 3 years ago

michalkozak commented 3 years ago

According to the schema of <certainty> element, its @assertedValue attribute do not accept whitespaces, although it used to

provide an alternative value for the aspect of the markup in question — an alternative generic identifier, transcription, or attribute value, or the identifier of an anchor element (to indicate an alternative starting or ending location)

Its type is defined as teidata.word

If someone would like to annotate an expression as uncertain and provide an asserted value which, in their opinion, should replace the uncertain piece of text, they have two options:

The second option is better because we only create one uncertainty annotation as we wanted, and not so many artificial annotations as words.

But in the second option we have another problem. According to the TEI guidelines identifiers of <anchor> elements are only allowed. So, for example, the following annotation, in which we refer to the alleged transcription is not fully compliant with the guidelines

<certainty  locus="value" target="#seg" assertedValue="#val"/>
<val xml:id="val">Lorem ipsum dolor sit amet</val>

Therefore we propose to enrich the type of @assertedValue with a pointer:

 <alternate>
  <dataRef key="teidata.word"/>
  <dataRef key="teidata.pointer"/>
 </alternate>

The pointers are also necessary when certainty tags refer to alternative values of attributes that take pointers as values (such as @ref or @samaAs). In particular, someone might annotate with uncertainty that two entities are same:

<certainty locus="value" cert="high" target="#person-3" match="@sameAs" assertedValue="file-1#person-100"/>

I would also like to point out that it is not clear from the TEI guidelines for @assetedValue whether the hash (#) has to be used for the identifier of an anchor element. In 8th example on https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CE.html we can read:

<certainty xml:id="cert3" target="#CE-p2" locus="start" assertedValue="#CE-a1" given="#cert1" degree="0.1"/>

However, in a review of our article submitted to the Journal of TEI, the reviewer wrote:

... your assertedValue if giving an @xml:id (e.g. of an anchor element) doesn't need the #. Your use pointing to an entity in this way is certainly non-canonical TEI but a possible feature request should be opened with the TEI-C to recommend the ability of pointing to any @xml:id with this in the note for assertedValue on the reference page.

sydb commented 3 years ago

I think OP (@michalkozak) has raised an important shortcoming: that there is no way to indicate an alternate proposed transcription in @assertedValue of <certainty>, even though the prose specifically licenses this. I think this is an important problem that needs attention.

But moreover, he raises two other important issues (perhaps without realizing it). First, that there is no mechanism, even though the prose suggests there is, to indicate which attribute one is uncertain about. The Guidelines do not provide any mechanism to ascertain whether a locus="name" is talking about the element name or an attribute name, and if an attribute name, which one. Interestingly, P2–P4 had such a mechanism (P1 did not have a <certainty> element); somehow it (the keyword for GI) was lost in the transition to P5.

Second, that there are examples of @assertedValue in the Guidelines that are clearly intended to be pointers, even though the datatype says they are just a token. (This touches upon the long-standing debate over the meaning of teidata.word. It was originally intended to simply to be a datatype for strings that, because they would be processed by programs, would do better to avoid things like white space and control characters, etc. It has since morphed into a keyword of some kind, which was never intended. Furthermore, most of the syntactic constraints have been removed as well, rendering it nearly useless.)

OP also proposes a solution (attribute assertedValue { ( teidata.word | teidata.pointer ) }) which I think is an untenably bad idea as it stands, for the reasons set forth previously. (But to be absolutely clear: although I do not like that particular proposed solution, I definitely think we need to come up with a solution.) I do not think it would take much tweaking to make it a useful solution, though. (Basically a mechanism for automatic differentiation of the datatype being used. But might be better just to have a different mechanism, e.g. @assertedValueTarget, entirely.)

I am not sure, but I strongly suspect this problem arose as collateral damage from the “war on attributes”. (The effort to eliminate transcriptional text in attribute values both because there is no way in XML (as there had been in SGML) to encode a character outside of Unicode in an attribute value, and because of the difficulties in indicating the natural language of the passage transcribed into an attribute value.) So I think the solution some might think is obvious “allow transcribed text as the value of @assertedValue” is unacceptable.

Possible solutions include (with the last two being bad ideas, IMHO):

  1. New attribute, @assertedValueTarget or some such.
  2. Some syntactic flag on @locus to differentiate whether @asssertedValue is using keywords or pointers (which is what P4 did).
  3. Allowing either teidata.word for keywords or teidata.pointer for pointers without some rule for automatic differentiation.
  4. Using transcribed text in @assertedValue.

Probably others I am now too tired to think of. :-)

P.S. @michalkozak: I do not understand the reviewers comments. In TEI we do not use the ID/IDREF mechanism (with which "duck" would point to an element with an @xml:id of "duck"), but rather the URI/ID mechanism (with which "#duck" is used for the same). But the history of <certainty> represents sort of the opposite: in P4 values that started with ‘#’ were keywords, those that did not were pointers.

Anyway, thanks for raising this!

michalkozak commented 3 years ago

I think @sydb was very tired when was writing:

But moreover, he raises two other important issues (perhaps without realizing it). First, that there is no mechanism, even though the prose suggests there is, to indicate which attribute one is uncertain about. The Guidelines do not provide any mechanism to ascertain whether a locus="name" is talking about the element name or an attribute name, and if an attribute name, which one. Interestingly, P2–P4 had such a mechanism (P1 did not have a element); somehow it (the keyword for GI) was lost in the transition to P5.

In P5 we have att.scoping, i.e. @target and @match attributes. Therefore when we want to indicate that certainty with locus="name" refers to an element name we only used @target, and when we would like to point an attribute we additionally use @match. The same rules are for locus="value". For the remaining values of @locus (start, end and location) only @target makes sense.

However, the problem with asserted value is indeed serious. We can analyse all the cases here:

  1. locus="name" - uncertainty concerns whether the name of the element or attribute used is correctly applied. So when we would like to add asserted value for the (un)certainty, teidata.word is sufficient (names of elements and attributes are single words).

  2. locus="start", locus="end" or locus="location" In these cases we need to use the <anchor> element and point in asserted value to this anchor. So, in fact, the asserted value is a pointer, although it was probably an assumption during designing these cases, that the pointer is relative and occurs in the same document (i.e. it starts with #). In reality, however, the certainty element with the @assertedValue attribute does not have to be in the same TEI document as the indicated uncertainty and <anchor>. In general, therefore, in these cases the asserted value should be teidata.pointer.

  3. locus="name" - uncertainty concerns the content (for an element) or the value (for an attribute). We have two subcases:

In conclusion, I would tend to the Syd proposal to add @assertedValueTarget attribute to add the possibility for pointing more complex asserted values. For simple ones I would leave the possibility of using @assertedValue (as teidata.word, but I would forbid pointers there).

sydb commented 3 years ago

Ha! Indeed. I would use the excuse that I was drunk, but I don’t drink! Anyway, I have a pretty full day today, hope to take a more careful look tomorrow.

sydb commented 3 years ago

Yessir, <certainty> is a member of att.scoping, not of att.pointing, which I (obviously) somehow had in my head.

Note, @michalkozak, that the difference between use of @target alone and with @match is not so simplistic — one can use @match when referring to an element, too. (E.g., <certainty target="#chap2" match="./p[position() = (3,4,5)]/soCalled" assertedValue="emph"/>.)

In any case, we are both leaning towards my possible solution (1) as being best solution to the actual problem. To summarize the proposed new world order, then:

locus="name"
locus="(start|end|location)"

Note: the semantics of a @target, @match, or @assertedValueTarget that matches multiple elements when @locus is "(start|end|location)" is not (yet?) defined by the Guidelines. Perhaps we should either think it through and define it, or say you should only point to one.

locus="value"

Note that in either case below it is quite difficult to provide validation of the asserted value, which should meet the constraints of the @target + @match node(s).

locus="value" and @target + @match select element node(s)

Either:

But not both. Because an attribute value cannot contain markup, the former is far less versatile than the latter, which is the preferred method. (I would be happy to just say “use @assertedValueTarget, but that would require a deprecation of @assertedValue for this purpose.)

locus="value" and @target + @match select attribute node(s)

Either:

But not both. Because an attribute value cannot contain markup, the former is perfectly adequate, and is the preferred method. Not sure whether, if the latter were used, the content of the element pointed to should be limited to a string, or the string value (e.g., string(.)) of the element should be used. (Would it be OK to not allow @assertedValueTarget for this purpose?)

Have I got that all right? Or am I still to freakin’ tired?

below

Without being able to explain why at the moment, I think it would be bad practice to express uncertainty about both <foo> and the @foo of <silly foo="bar"> with the same <certainty>. That is, the @target + @match should select element(s) or attribute(s), but not both. It is (of course) in principle possible that they could select one or more PI or comment nodes instead, but I have never seen a use case in DH for expressing machine-readable uncertainty about a PI or comment.

ebeshero commented 3 years ago

@michalkozak What do you think of @sydb 's ideas to re-evaluate the semantics of these attributes? Would this address the problem you posted about?

michalkozak commented 3 years ago

Yes, I agree with @sydb. But out of curiosity, what will be a type for @assertedValue?

ebeshero commented 2 years ago

Council agrees that @sydb will propose this, likely with a customization ODD first for Council to review.