djbpitt / repertorium

Repertorium of Old Bulgarian literature and letters
1 stars 2 forks source link

<supportDesc> redundancy: @material and <material> #7

Open djbpitt opened 2 years ago

djbpitt commented 2 years ago
  1. When you have
  <supportDesc material="paper">
     <support>
         <material>Paper</material>
     ...
     </support>
</supportDesc>

and there is no further information in <material>, than Paper or Parchment I suggest to remove the element material entirely.

Or to remove material="paper" and to have

<supportDesc>
     <support>
         <material>Paper</material>
     ...
     </support>
</supportDesc>

This information should be unified somehow, but to repeat one and the same data makes no sense.

If, however we have

  <supportDesc material="mixed">
     <support>
         <material>Paper</material>
     ...
     </support>
</supportDesc>

then we should write some more information, of course. The same is valid if we would like to have some more data about the paper or parchment. Then we will need:

  <supportDesc material="paper">
     <support>
         <material>Paper is of low quality</material>
     ...
     </support>
</supportDesc>

This situation is more complicated because there are several possible variants, and I agree that the most important thing is for us to be consistent. Your examples above are of three types:

  1. Attribute and element are identical and simple, e.g., both say just "paper".
  2. Attribute is "mixed" and there are multiple elements.
  3. Attributes is simple (e.g., "paper") and element is more detailed (e.g., "paper is of low quality")

We might want to approach this question by asking how we want to use the values. Here is a proposal (for discussion; I don't mean to suggest that it is necessarily what we should do):

  1. The @material attribute on the <supportDesc> element is for structured search and retrieval. For that reason, it's a token list drawn from a fixed inventory of strings: "paper", "parchment", and whatever else might actually occur (stone? wax? birchbark?). Because it is a token list, we would not use a value like "mixed"; if a manuscript includes both parchment and paper, we would write <supportDesc material="parchment paper">. The order of the values in a token list is not informational, so "parchment paper" and "paper parchment" are equivalent. The attribute is required and the value must include at least one token from the allowed list.

A: I don't quite understand what is wrong with "mixed", but anyway I tried to write <supportDesc material="parchment paper">, but it triggers immediately an error. According to TEI Schema:

attribute material { "paper" | "parch" | "mixed" | [teidata.enumerated](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-teidata.enumerated.html) }?,

https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-supportDesc.html

Attributes using this datatype must contain a single ‘word’ which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace.

That means that only one word here is allowed. So, if we would like to use an attribute here which is different from paper or parch, we can use just mixed. The other way around is to change the schema for supportDesc? Should we do it?

  1. The <material> child of <support> is for human eyes, that is, it's what we render in the codicological description. It is optional, but whether we use it is governed by the following considerations:

a) Where there is a single material and no supplementary information (e.g., paper), we omit the <material> element. In the codicological description we'll upper-case the value of the attribute. This lets us avoid the duplication. We can validate this with Schematron: if the count of tokens in @material is not equal to 1, there must be at least one <material> element.

b) Whether there are multiple materials, the @material attribute contains more than one token and a separate <material> element is required for each type of support. If, for example, the manuscript is on a combination of parchment and paper, there will be two tokens in the @material attribute value and at least two <material> elements. There might be more than two if, for example, there are multiple types of paper.

c) Even when there is one type of material, the <material> element can be used if the description presented to humans should be more detailed than what the attribute allows. For example, the @material value might be just "paper", while the content of the <material> element would read something like "Paper of poor quality".

A: David, you describe very good the possible situations. So I will suggest:

a) You know that material is paper or parchment, but you have no further information, then encode this as:

<supportDesc material="Paper">
     <support>
     ...
     </support>
</supportDesc>

or

<supportDesc material="Parchment">
     <support>
     ...
     </support>
</supportDesc>

With upper case letter.

b) You have a mixture of paper and parchment. Here we should decide whether we will change the model of supportDesc allowing both words as value of material="Parchment Paper" (upper case), or we will stick with the value "mixed". (If we decide to change supportDesc I don't know what kind of attribute class should be this allowing us to have two words as attribute value). What do you think? Then, as you suggested we will have two elements (it is repeatable).

c) You have some more information about paper, parchment, etc. Encode this as:

<supportDesc material="Paper">
     <support>
         <material>Paper is of low quality</material>
     ...
     </support>
</supportDesc>

So, in principle we should decide whether we would like to change the model for supportDesc or leave it as it is?

1) Current possibility:

<supportDesc material="mixed">
     <support>
         <material>Paper is of low quality</material>
     <material>Thin parchment. There is almost no distinction between the flesh and hair side ...</material>
...
     </support>
</supportDesc>

2) Changing @material:

<supportDesc material="Paper Parchment">
     <support>
         <material>Paper is of low quality</material>
     <material>Thin parchment. There is almost no distinction between the flesh and hair side ...</material>
...
     </support>
</supportDesc>

If we would like to retain both views: description of MS as database and description of MS as user perspective (reading as text), maybe the second one is better. What do you think?

atoboy commented 2 years ago

I would suggest

<supportDesc material="Mixed">
     <support>
         <material>Paper is of low quality</material>
     <material>Thin parchment. There is almost no distinction between the flesh and hair side ...</material>
...
     </support>
</supportDesc>

@djbpitt

djbpitt commented 2 years ago

@atoboy I've reopened the issue just to clarify two details:

  1. Specifying "mixed" as the value of @material, as you propose, doesn’t tell us what the mixture is. In most cases it will be parchment and paper, but could it be anything else? If we think of the @material attribute as primarily for searching purposes, might it be more informative to use a token list, along the lines of "parchment paper"? If we do that, we can search for supportDesc[contains-token(@material, 'paper')] to find all @material values that include "paper", that is, it will find both "paper" by itself and "paper" when it is mixed with something else. If we specify "mixed", we would need supportDesc[@material = ('mixed', 'paper')], and that works only if "mixed" always includes paper as one of its implicit values. Either approach will work, but a token list is more informative because it names the components, while saying only "mixed" makes them only implicit.
  2. I carelessly used upper-case when I wrote material="Paper" earlier, but I think we should standardize on lower-case for single words, as we’ve done elsewhere. That is, instead of "Parchment" we would write "parchment", and similarly for all possible values of @material.

Please let me know what you think.

atoboy commented 2 years ago

@djbpitt

  1. So you suggest to change the attribute class of @material in order to allow more than one word for attribute value? Now the definition of this attribute in TEI Guidelines is:
    attribute material { "paper" | "parch" | "mixed" | teidata.enumerated}?

    TEI Guidelines -- teidata.enumerated.

However, I am not sure what type of TEI datatype we should use in this case: TEI Guidelines -- Appendix E Datatypes and Other Macros

  1. I think that lower-case will be more consistent with our attribute values elsewhere. It could be easily changed to upper-case for HTML publication. So, I will revert back the changes.
djbpitt commented 2 years ago

@atoboy teidata.enumerated requires a single word (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-teidata.enumerated.html), and I'm suggesting that we want to require one or more items from a fixed list. The content model might look like:

attribute material {
  list { ( "paper" | "parchment" | "wax" | "stone" | "cloth" )+ }
}

I don't mean to suggest that we should include wax or stone or cloth; the point is that we should include all of the materials we actually find or can reasonably expect to find. The list structure in Relax NG allows a token list, so it will allow any combination of those items.

Unfortunately, the list structure also allows repetition, so it would allow a value like:

<supportDesc material="paper paper parchment"> … </supportDesc>

This type of repetition would be a mistake, so we would want to prevent it, and it isn’t possible to do that with Relax NG alone. When I use this sort of schema in other projects, I add a Schematron rule to prevent repetition. With respect to @material, if we agree to use a Relax NG list and if you can update the ODD and the XML documents, I can add the Schematron rule.


This is a separate but related issue:

You may already know that it is possible to integrate Schematron rules into an ODD, instead of using a separate Schematron schema. I use a separate Schematron schema in my other work for two reasons:

  1. I find it awkward to integrate the Schematron into the ODD. That's really just a comment about my own ignorance, though, so if you're comfortable integrating Schematron into ODD, that reservation isn't relevant.
  2. I modify Schematron pretty frequently as I discover new markup inconsistencies that I want to be able to catch. If our Schematron is part of our ODD, that means updating the ODD and regenerating the Relax NG more frequently. That isn't a burden, but it is more complicated than just updating the Schematron by itself.

In favor of integrating the Schematron into the ODD, though, is that I think it would integrate that aspect of the documentation, since the documentation is created from the ODD.

So: If you agree with the proposal that we use a list here, we should use Schematron to prevent repetition, and we should decide whether to do that in a separate Schematron schema (we already have one, so I would just add another rule to that) or whether we should integrate the Schematron rule into the ODD.

I would recommend that we defer that decision by continuing to use a separate Schematron file for now. Once we're satisfied with our changes to the ODD, we can then revisit the question of whether to integrate our Schematron validation into the ODD.