How to encode measurement

naoki-kokaze commented 6 years ago

I'm Naoki who gave a poster presentation at the TEI 2017 at UVic, on how to mark up UNITS that were not based on the metric system. It should be important for us to discuss measurement in a broad sense, because the problem on measurement implicates and represents cultural diversity. A Wikipedia article may help us to understand the importance of discussing cultural uniqueness through measurement.

And, some of the members of TEI have already discussed how useful \<unit> element is. Please see this https://github.com/TEIC/TEI/issues/1461

First of all, I would like to share my revised version of poster which omits the image of the historical source due to the license. TEI_2017_poster_for_github.pdf

The problem is that: how should it be marked up?

銅二千五百十六斤十両二分四銖

which means 'Copper whose weight is 2516Kin, 10Ryo, 2Bu and 4Shu'. It might be complicated to see what's happening, but you might understand it by considering some examples of British measurement, like 12yd, 2ft and 10in, which were also not based on the metric system.

Based on the discussion within the conference, there might be at least two solutions.

(i) Using only one \<measure> element: <measure commodity="銅" n="2516/10/2/4" unit="斤/両/分/銖" />

In current scheme, we can't use \@quantity to store multiple values. Though, we can use any delimiter instead of using slash.

(ii) Using \<measure> element to nest the other \<measure> elements. ...Sorry, I have to get on the plane. Please see the poster file to find a second of the possible solutions!

jamescummings commented 6 years ago

(Updated your comment to let github provide link to #1461) The <unit> element would definitely help. While I agree there is some improvement needed here, as a starting question, we should separate the use of attributes (which we might see as providing, regularised SI units in metric and similar systems) and the original text stored in the element content. So I'd encourage solutions where the value provided in the attributes is, as much as possible, stored in ISO standard measurements, etc.

That said, I've done similar things with Pounds, Shillings, Pence, in historical text in the UK. While I'm not recommending it as a solution, I've used similar embedded broken up things using num or measure much like you suggest. Things like:

<seg type="fee" rend="roman-numerals aligned-right">
  <num type="totalPence" value="1240">
    <!--orig: vli iijs iiijd -->
    <num type="poundsAsPence" value="1200">v<hi rend="superscript">li</hi></num>
    <num type="shillingsAsPence" value="36">iij<hi rend="superscript">s</hi></num>
    <num type="pence" value="4">iiij<hi rend="superscript">d</hi></num>
  </num>
</seg>

As you can see in: http://jtei.revues.org/926#tocto1n7 (which is really on a different topic). So, Like you, I broke down the individual components and then wrapped those in another which totalled the standardised amount.

So I think the kinds of things you want to do can be done in the TEI, but we don't currently recommend a particular method. (But please do explain where I'm misunderstanding, since that will help the issue progress.)

dariok commented 6 years ago

I'd like to give a quick wrap up of the possible solution we discussed about yesterday. We came to the conclusion, that several steps are necessary in most cases. This is due to the fact that two different things have to be taken into account: 1) How to mark up the text and include a parseable rendition of the quantity 2) How to describe how the different 'sub-measurements' actually relate to each other and, possibly, how these can be converted to other systems of measurement.

A possible solution might look like this: 1) in the text: <measure commodity="銅" quantity="2516 10 2 4" unit="斤両分銖">銅二千五百十六斤十両二分四銖</measure> or, to use a European example: <measure quantity="2 7 4" unit="ll ls ld">£4/6/2</measure>

2) in the header

    <unitDecl>
        <unit name="kin" xml:id="kin">
            <label>斤</label>
            <unit ref="#ryo" factor="16" />
        </unit>
        <unit name="Ryo" xml:id="ryo">
            <label>両</label>
            <unit ref="#bu" factor="4" />
        </unit>
        <unit name="Bu" xml:id="bu">
            <label>分</label>
            <unit ref="#shu" factor="6" />
        <unit>
        <unit name="Shu" xml:id="shu">
            <label>銖</label>
        </unit>
    </unitDecl>

for the example with 1Kin = 16Ryo, 1Ryo = 4Bu, 1Bu = 6Shu (appologies if I got something wrong here, please correct me if necessary) or

    <unitDecl>
        <unit name="Pounds Sterling" xml:id="ll">
            <label>£</label>
            <unit ref="#ls" factor="20" />
        </unit>
        <unit name="Shillings" xml:id="ls">
            <unit ref="#ld" factor="12" />
        </unit>
        <unit name="Pence" xml:id="ld"/>
    </unitDecl>

In the same manner, an equivalent to other system could be given:

    <unitDecl>
        <unit name="pounds" xml:id="lb">
            <unit ref="#kg" factor="0.453592" />
        </unit>
    </unitDecl>

This would require the data type for @quantity and @unit to be changed from teidata.numeric to 1–∞ occurrences of teidata.word separated by whitespace and possibly the introduction of a new unitDecl and unit for the teiHeader. We have also discussed including this within a taxonomy and while the same result could be achieved, I think this is better as it is easier to parse and possibly compute conversions into other systems.

The reason I prefer this over nested <measure> is that it does not rip apart parts of text that are considered to be one and that it provides the possibility to convert or at least make conversion and the relation between the measurements easily recognizable.

I'm sure some discussion and refinement is necessary but I hope I've given a sound reason and a good account of our discussion.

The reason why I do not include ISO values in the attributes as was suggested by @jamescummings is that there might well be the situation that a conversion is not possible as only the relations of the units are known. Additionally, it is well possible that the exact conversion differs (so, e.g. a 'Zoll' or 'Elle' can differ largely between different cities). In order to include the information nonetheless without putting too much into the attributes, I think it is a good idea to include this information in the header.

laurentromary commented 6 years ago

Hi Naoki. I hope you had a good flight. This is a very good candidate for <unit>. What about:

<measure>
    <measure>銅<measure>
    <num>二千五百十六</num>
    <unit>斤</unit>
</measure>
<measure><num>十</num><unit>両</unit></measure>
<measure> <num>二</num><unit>分</unit></measure>
<measure> <num>四</num> <unit>銖</unit></measure>
</measure>

. The second level of <measure> could be omitted if you don't need so much structure, but the information it marks-up is useful.

duncdrum commented 6 years ago

@naoki-kokaze I think your example shows another good candidate for the addition of <unit>. As @dariok pointed out already, for historical measures a taxonomy component is required in the header. How much a catty weighed changed greatly over time and region. I like the idea of <unitDecl>.

it does not rip apart parts of text that are considered to be one

how would one determine that, and how is not all markup ripping things up, especially in texts that don't use sentence or word boundary markers? If you parse the contents of measure[commodity = "copper"]/text() you get 銅二千五百十六斤十両二分四銖

I dislike combining multiple natural languages inside of elements. <measurement type="weight" commodity="銅">, depending on the period this might help with suitable unit names:

<measure type="weight" commodity="copper" quantity="1501.9466" unit="kg">銅
  <measure quantity="2516" unit="catty">二千五百十六斤</measure>
  <measure quantity="10" unit="tael">十両</measure>
  <measure quantity="2" unit="mace">二分</measure>
  <measure quantity="4" unit="scruple">四銖</measure>
</measure>

with a taxonomy in the header:

<taxonomy>
 <category xml:id="catty">
   <catDesc>
     <measure type="weight" quantity="1" unit="catty">1 斤</measure> = <measure quantity="596.8" unit="g">16 両</measure></catDesc>
  </category>
  <category xml:id="tael">  
...
</taxonomy>

jamescummings commented 6 years ago

I like the kind of solution shown by @duncdrum where there are not multiple white-spaced separated attribute values. I'm biased that these seem easier to process and read.

naoki-kokaze commented 6 years ago

I'm grateful for all of your comments! I'm sorry for being late to respond because of the flight and time difference. I'll briefly reply to a solution by @duncdrum at first.

While I agree with using multiple measure elements to nest each of the units and \<taxonomy> element to explain the semantics of measurement in the teiHeader, I would like to store the original form of commodity and unit in Japanese. For example, I prefer @commodity="銅", @unit="斤" or "両", rather than translating and converting to widely-known English words,like "copper", "kg", "catty".

How do everyone think about multiligualism when it comes to marking up texts written in native languages other than English? It seems to me challenging and unnatural to translate ancient Japanese words into our contemporary English vocabularies. The other members of my joint research project have also the same viewpoints as mine. It might be the kind of problem on the validity of normalization and standardization. How do you think about that?

ebeshero commented 6 years ago

@naoki-kokaze I agree with you. Translation of values to English or another language is not necessary if this is not useful in your project. Explanation of unit equivalences (one issue we were discussing) would be relevant if you imagine this necessary for outreach and interchange with projects in other places working on related material. Explanation of equivalence and translation to other languages would make sense as metadata, if it's desired.

naoki-kokaze commented 6 years ago

@ebeshero Thank you for your reply that makes sense. I understand the necessity for translation in some cases where our project might be expanded to reach target audiences in other regions.

By the way, @dariok, thank you for wrapping up your discussion with ＠ebeshero at the conference. \<unitDecl> descriptions are really great! I'd like to describe the measurement semantics for other units on length and currency.

As @duncdrum mentioned, I have come to think that in terms of parseablity, we might not necessarily care too much about overnested information that I thought could be culturally unnatural. ＠dariok, what do you think now about the balance between human-readability and overnestedness?

I'll respond the rest of comments later again.

duncdrum commented 6 years ago

@naoki-kokaze

I would like to store the original form of commodity and unit in Japanese.

your project can of course use TEI in Japanese to encode a Japanese text. There is even translation project for TEI in Japanese.

It seems to me challenging and unnatural to translate ancient Japanese words into our contemporary English vocabularies.

Since TEI is an interchange format I d like to point out that the units in your example are not native to Japan and have been used in other languages in East Asia for a very long time indeed. Which is why I don't like using transliterations in dariok's encoding. Many texts and languages used 斤 as a weight measure, only in the context of modern Japanese does it make sense to transliterate that as kin with the appropriate @xml:lang="ja-Latn". My personal a rule of thumb, if the historical contemporaries used the translation it should be save to use. (as is the case with catty, scruple, …)

How do everyone think about multiligualism when it comes to marking up texts written in native languages other than English?

Somewhere in your TEI document you will need to specify, e.g. @xml:lang="ja" to signal that the text you encode is written in Japanese. Dariok has also omitted these in his unitDecl. My problem with the examples from your poster:<measure type="weight" quantity="10" unit="斤">銅二千五百十六</measure>斤 already mixes translations in two different languages, English and Japanese. This, however, is not in line with the xml specs:

The language specified by xml:lang applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance of xml:lang.

As someone encoding ancient Chinese texts into TEI, I would argue that even 10 is not ancient Japanese. So <measure type="重" quantity="十" unit="斤"> could be proper xml, but I have my reservations about it being a useful TEI representation. It still requires that Japanese readers are familiar with English element and attribute names. Also good luck with convincing parsers that numerals don't have to be written in ASCII digits. You have two choices, expanding on dariok's example. Either liberal and consistent use of @xml:lang in you document:

<unit name="catty" xml:id="catty">
  <label xml:lang="ja">斤</label>
  <label xml:lang="ja-Latn">kin</label>
  <label xml:lang="en">catty</label>
  <unit ref="#tael" factor="16" />
</unit>

or modifying the language of your TEI document via odd. If you don't want to use translations, at least provide alphabetic transliterations so that users can transform your Japanese tei document into english but know that attribute values are ja-Latn as opposed to e.g. zh-Hans.

naoki-kokaze commented 6 years ago

@duncdrum, thank you very much for your suggestive comments! Although I potentially know it, I am really impressed by cultural similarity among East Asia through considering the examples of measurement. @xml:lang is crucially needed to specify our text is written in Japanese. I'll try to encode measurement by \<unitDecl>! It would be great if all of you checked it.

naoki-kokaze commented 6 years ago

I appreciate for all of your contributions! I'll wrap up our discussion by offering some possible solutions. First of all, I would consistently use \@xml:lang="ja" (and \@xml:lang="ja-Ltn") so as not to translate units into other languages than Japanese, as the first step of our project. But I would like to use basic English vocabularies in order to make the markup readable for TEI-capable Japanese users at least.

Please let me know how you feel.

(i) Based on the contributions of @dariok, @laurentromary and @duncdrum, Using \<unitDecl> & \<unit>. I like this solution because we can easily understand the semantics of measurement by referencing the \<unitDecl> in the teiHeader. Also, the relationship between \<unitDecl> and \<unit> elements is clear. However, it is still impossible to encode the whole amount of weight as long as we don't use widely-known contemporary units like 'kg': that's the very point the third possible solution could resolve, as later mentioned.

in the teiHeader:

<unitDecl>
    <unit name="斤" xml:id="斤" type="weight">
        <label xml:lang="ja">斤</label>
        <label xml:lang="ja-Ltn">kin</label>
        <unit ref="#両" factor="16" />
    </unit>
    <unit name="両" xml:id="両" type="weight">
        <label xml:lang="ja">両</label>
        <label xml:lang="ja-Ltn">ryo</label>
        <unit ref="#分" factor="4" />
    </unit>
    <unit name="分" xml:id="分" type="weight">
        <label xml:lang="ja">分</label>
        <label xml:lang="ja-Ltn">Bu</label>
        <unit ref="#銖" factor="6" />
    </unit>
    <unit name="銖" xml:id="銖" type="weight">
        <label xml:lang="ja">銖</label>
        <label xml:lang="ja-Ltn">Shu</label>
    </unit>
</unitDecl>

in the body text:

<measure commodity="銅">銅
    <measure><num value="2516">二千五百十六</num><unit sameAs="#斤">斤</unit></measure>
    <measure><num value="10">十</num><unit sameAs="#両">両</unit></measure>
    <measure><num value="2">二</num><unit sameAs="#分>分</unit></measure>
    <measure><num value="4">四</num><unit sameAs="#銖>銖</unit></measure>
</measure>

(ii) Using the existing elements and attributes, in order to express the same semantics as above mentioned in (i). But, I am lost in defining that 1Kin is equal to 16Ryo. Can we do that as @duncdrum presented?

\<measure type="weight" quantity="1" unit="catty">1 斤\</measure> = \<measure quantity="596.8" unit="g">16 両\</measure>\</catDesc>

in the teiHeader:

<taxonomy>
    <category xml:id="斤" xml:lang="ja">
        <catDesc>
            <measure type="weight" quantity="1" unit="斤">一斤</measure> 
            = <measure type="weight" quantity="16" unit="#両">十六両</measure>
        </catDesc>
    </category>
    <category  xml:id="両" xml:lang="ja">
        <catDesc>
            <measure type="weight" quantity="1" unit="両">一両</measure> 
            = <measure type="weight" quantity="4" unit="#分">四分</measure>
        </catDesc>
    </category>
…
</taxonomy>

in the body text:

<measure commodity="銅">銅
    <measure type="weight" quantity="2516" unit="#斤">二千五百十六斤</measure>
    <measure type="weight" quantity="10" unit="#両">十両</measure>
    <measure type="weight" quantity="2" unit="#分">二分</measure>
    <measure type="weight" quantity="4" unit="#銖">四銖</measure>
</measure>

(iii) to borrow the idea of @jamescummings, Using \@type="totalPence", in such a case where a total sum of values converted into minimum unit is needed. The reason why I like his idea is that we can easily decode the hierarchical structure of measurement and the whole amount of weight thanks to the \<num> element which contains the other \<num> elements. But I would like to differ whether we had better convert into minimum unit. Because there should be some cases where the other unit than minimum one has the importance in terms of historical context as the Engi-shiki shows in my poster.

the same description as (i) in the teiHeader:

in the body text:

<measure commodity="銅">銅
    <num type="totalShu" value="966400" unit="銖">
        <measure type="weight" quantity="2516" unit="#斤"><num type="KinAsShu value="966144">二千五百十六</num>斤</measure>
        <measure type="weight" quantity="10" unit="#両"><num type="RyoAsShu value="240">十</num>両</measure>
        <measure type="weight" quantity="2" unit="#分"><num type="BuAsShu" value="12">二</num>分</measure>
        <measure type="weight" quantity="4" unit="#銖"><num type="Shu" value="4">四</num>銖</measure>
    </num>
</measure>

It seems to me that the first one could be preferable to choose as a practical method for our project, though I am not so familiar with customization. I have to learn it more.

duncdrum commented 6 years ago

I think @ebeshero has a solution in mind, that allows for your desired mix of english and japanese: <measure type="weight" quantity="2516" unit="#斤">. I m afraid I don't see it. The scope of @xml:lang according to the xml specs extends to attribute values. So maybe we should clarify that first.

ja	重量	斤
ja-Latn	juryo	kin
en	weight	catty

depending on that we can think about where there might still be problems with measure or not.

ebeshero commented 6 years ago

F2F Victoria 2017 notes: Council agrees with concept of <unitDecl> for the teiHeader, and it should be in att.datable. The child element of <unitDecl> should be <unitDef>, not <unit>. We need to figure out what the best practices should be for representing values.

ebeshero commented 6 years ago

Green-lighted to develop a proposal.

jamescummings commented 6 years ago

I think we should make the difference between normalisation of values in attributes (e.g. to SI units) vs regularisation of values (just because they are foreign concepts) more clear as well.

Council definitely likes the concept of <unitDecl> in the header (presumably a member of model.encodingDescPart) it might rename the child <unit> as <unitDef> to then use <unit> is it is underneath that. Perhaps something like:

<unitDecl>
    <unitDef name="斤" xml:id="斤" type="weight">
        <label xml:lang="ja">斤</label>
        <label xml:lang="ja-Ltn">kin</label>
        <unit ref="#両" factor="16" />
        <desc>A description here </desc>
    </unitDef>
    <unitDef name="両" xml:id="両" type="weight">
        <label xml:lang="ja">両</label>
        <label xml:lang="ja-Ltn">ryo</label>
        <unit ref="#分" factor="4" />
        <desc>A description here </desc>
    </unitDef>
    <unitDef name="分" xml:id="分" type="weight">
        <label xml:lang="ja">分</label>
        <label xml:lang="ja-Ltn">Bu</label>
        <unit ref="#銖" factor="6" />
        <desc>A description here </desc>
    </unitDef>
    <unitDef name="銖" xml:id="銖" type="weight">
        <label xml:lang="ja">銖</label>
        <label xml:lang="ja-Ltn">Shu</label>
        <unit factor="1" />
        <desc>A description here </desc>
    </unitDef>
</unitDecl>

naoki-kokaze commented 6 years ago

Thank you everyone, indeed. I'll address to create ODD file through Roma (hopefully with the support of Kiyonori Nagasaki) by referencing the definitions of elements of model.encodingDescPart, and after finishing customization, I'll ask you to check its validity.

Also, I think I understand the point @duncdrum presented. If I use \@xml:lang="ja", then I had better consistently describe values of attributes in Japanese, as well, right?

naoki-kokaze commented 6 years ago

How about that? I've tried to define \<unitDecl> element on Roma. roma_unitdecl

duncdrum commented 6 years ago

@jamescummings unless I m missing something in which case I'd very much appreciate a pointer type='weight' = xml:lang="en" != xml:lang"ja"

as for <unitDecl>

I think the guidelines should clarify its relation to <calendarDesc>. I m thinking about an ancient Chinese cookbook. Where do the units that correspond to hours, minutes, and seconds go? The text never forms a full date, but constantly talks about smaller time units.
normalisation, the guidelines should contain one example where normalisation to SI units is only approximate using @precision
do we require normalisation of every uniDef or is one per type sufficient?
calendarDesc and langUsage are profileDesc, why would unitDef be encodingDesc, and why not unitDesc?

naoki-kokaze commented 6 years ago

As a starting point, I have created an ODD file based mainly on the model @jamescummings presented. Of course there can be many points to be improved (because this is my very own first ODD file!), so I would appreciate if all of you gave me feedback.

I defined three elements (\<unit>, \<unitDef> and \<unitDecl>) and one attribute (\@factor as a member of att.typed so as to contain in \<unit> element). I'll take a note on each of them.

(i) \<unit> | Whereas I borrowed almost all of the definition of this element suggested by @laurentromary in #1461, I added att.canonical to it and customized att.typed, as later mentioned in (iv).

<elementSpec ident="unit" ns="http://www.example.org/ns/nonTEI" mode="add">
    <desc>contains a symbol, a word or a phrase referring to a unit of measurement in any 
          kind of formal or informal system.</desc>
    <classes>
        <memberOf key="model.measureLike"/>
        <memberOf key="att.canonical"/>
        <memberOf key="att.global"/>
        <memberOf key="att.lexicographic"/>
        <memberOf key="att.typed"/>
    </classes>
    <content>
        <macroRef key="macro.paraContent"/>
    </content>
</elementSpec>

(ii) \<unitDef> | I didn't know which of the model classes is the best, so I chose model.global.

<elementSpec ident="unitDef" ns="http://www.example.org/ns/nonTEI" mode="add">
    <desc>contains descriptive information related to certain unit.</desc>
    <classes>
        <memberOf key="model.global"/>
        <memberOf key="att.global"/>
    </classes>
    <content>
        <alternate maxOccurs="unbounded">
            <elementRef key="unit" minOccurs="1"/>
            <classRef key="model.labelLike"/>
        </alternate>
    </content>
</elementSpec>

(iii) \<unitDecl> | For now, I chose model.encodingDescPart.

<elementSpec ident="unitDecl" ns="http://www.example.org/ns/nonTEI" mode="add">
    <desc>(unit declarations) provides information basically about non-SI (the International
          Systems of unit) units and measurement.</desc>
    <classes>
        <memberOf key="model.encodingDescPart"/>
        <memberOf key="att.canonical"/>
        <memberOf key="att.datable"/>
        <memberOf key="att.global"/>
    </classes>
    <content>
        <elementRef key="unitDef" minOccurs="1" maxOccurs="unbounded"/>
    </content>
</elementSpec>

(iv) \@factor | In order to define \@factor in \<unit> element, I added it as a member of att.typed rather than att.canonical.

<classSpec ident="att.typed" type="atts" mode="change" module="tei">
    <attList>
        <attDef ident="factor" mode="change">
            <desc>shows factors of numerical values given in a referenced &lt;unit&gt;
                  element</desc>
            <datatype minOccurs="1" maxOccurs="1">
                <rng:ref name="data.numeric"/>
            </datatype>
        </attDef>
    </attList>
</classSpec>

Please let me know how you feel in anyway and feel free to make an addition to them.

duncdrum commented 6 years ago

@naoki-kokaze I'm not sure why these:

<memberOf key="att.lexicographic"/>
<memberOf key="att.typed"/>

are on @ident="unit"

To match James' example in this thread you would need to add

<memberOf key="att.typed"/>

to @ident="unitDef"

naoki-kokaze commented 6 years ago

@duncdrum Thank you for your addition! I would like to revise it.

jamescummings commented 6 years ago

@duncdrum re: the @type value of "weight". Just to clarify @type is a teidata.enumerated datatype so projects could (and indeed I'd argue should) customise those in their project ODDs. If they want the values of type to be in Japanese, Korean, Malay, or Proto-Mayan. I was using 'weight' just based on naoki's earlier example. Though, all of that said, the more they can normalize enumerated values to standard international terms the better -- however those terms may certainly be in Japanese! re: calendarDesc ... you may have a point there. I suspect we'll find lots of minor alterations to these elements quite quickly. re: normalisation ... you are right, that would be good. If you find a good example, let us know. re: where to put type. Ideally available on all three elements (since they are repeatable and classifiable). I can imagine mixed types, but generally could also be put on highest appropriate level. re: location ... in discussing it we thought it was more about normalising the value of an attribute on measure (which I'm guessing should be something like unitRef rather than unit). Thus it fits in more with the encodingDesc documentation about how the encoding is being done (like prefixDef). I think the collection of them is a declaration of a set of unit definitions thus unitDecl with unitDefs inside them.

@naoki-kokaze Some quick thoughts:

Yeah, I don't see what you get by being a member of att.lexicographic? Is there a particular attribute you get that you think is important here?
Yes, you need att.typed in unit and unitDef at least
Why does unitDecl need att.canonical? (I can see an argument for this on unitDef though to point to some external standard equivalent?)
I don't think unitDecl needs to be a member of att.datable but that unitDef does. (i.e. the entire set of things may not be datable, but any individual unit may exist only during a certain period.)
I think that @factor should just be created individually on unit (or in its own new class) rather than att.typed... lots of att.typed members would have no need of it.
I think the proposed unit attribute on measure should really be a unitRef if it is going to point to this declaration.

duncdrum commented 6 years ago

@jamescummings thanks for being so patient. I have the feeling that i don't express my problem clearly enough, so i ll have one last try:

Multi-lingual XML

Extensible Markup Language (XML) 1.0 (Fifth Edition) > P5 Guidelines

If the current guidelines allow users to create valid tei documents by combining suggested default attribute values in en, supplemented by custom values in another language, the result will not match the xml1.0 specs. To me, this seems to be a separate but rather important issue. Unless my reading of the specs is wrong, the guidelines and council encourage something they really shouldn't.

Measurement

With temporal measures aka calendars the default attribute values are normalised and regularised by an external standard xs:dateTime, but the @calendar and @date-custom allow for customisations, allowing for greater amount of cultural neutrality. When it comes to measurements as a textual phenomenon I see the same three levels at play that apply to calendars:

Normalising (1 catty = 596.5 g),
Regularising (1 catty = 16 tael),
use-in-document

Let's take a logbook from a Japanese steam ship company at the turn of the 19th century. These are often multilingual so 一斤 is always 1 catty. Yet, when loading 1斤 of bronze in Hong Kong the Chinese catty is 590g, when trading that catty later with Americans in Hawaii it is 596.5g. Trading with a French company in the same harbour on the same day it ll be 592g. In all cases the log will record 1 catty of bronze.

Encoders might well wish to incorporate such information, to prevent wild goose chases by grad students for stolen/smuggled/transubstantiated copper, because the numbers don't add up. The use-in-document is not just a question of normalisation.

I m all for normalising to SI units, which we can do via <taxonomy> and @unit='#mytaxo'. But the point of this whole discussion seems to be that TEI wants to capture measurements in "all texts, all languages, and any form" without imposing its own cultural preferences. For that, the <calendarDesc> strategy seems better suited.

Note we haven't even brought up fiction where Cpt. Kirk going at warp 7, is significantly slower then Picard going warp 7...

naoki-kokaze commented 6 years ago

I am interested in the example @duncdrum presented.

Let's take a logbook from a Japanese steam ship company at the turn of the 19th century. These are often multilingual so 一斤 is always 1 catty. Yet, when loading 1斤 of bronze in Hong Kong the Chinese catty is 590g, when trading that catty later with Americans in Hawaii it is 596.5g. Trading with a French company in the same harbour on the same day it ll be 592g. In all cases the log will record 1 catty of bronze.

But, for now, I would like to fix my ODD file. @jamescummings I really appreciate for your comments on my first ODD! I have tried to redefine them, as following:

(i) \<unit> element

I don't see what you get by being a member of att.lexicographic? Is there a particular attribute you get that you think is important here?

I define this element by borrowing the model of @laurentromary presented in #1461. In that, @laurentromary argues that att.norm be necessary when it comes to clarifying the complete form of the abbreviated name of the unit (ex. \<unit \@norm="kilogram">kg\</unit>). But, if so, I have become to think that it might be better to define \<unitDef> as a member of att.lexicographic, as later shown.

I think that @factor should just be created individually on unit (or in its own new class) rather than att.typed... lots of att.typed members would have no need of it.

Thank you, I define \@factor as a new attribute in this element.

<elementSpec ident="unit" ns="http://www.example.org/ns/nonTEI" mode="add">
    <desc>contains a symbol, a word or a phrase referring to a unit of
        measurement in any kind of formal or informal system.</desc>
    <classes>
        <memberOf key="model.measureLike"/>
        <memberOf key="att.canonical"/>
        <memberOf key="att.global"/>
        <memberOf key="att.typed"/>
    </classes>
    <content>
        <macroRef key="macro.paraContent"/>
    </content>
    <attList>
        <attDef ident="factor" mode="add">
            <desc>shows factors of numerical values given in a
                referenced &lt;unit&gt; element</desc>
            <datatype>
                <dataRef key="teidata.numeric"/>
            </datatype>
        </attDef>
    </attList>
</elementSpec>

(ii) \<unitDef> element

you need att.typed in unit and unitDef at least

Yes, I fixed it.

I don't think unitDecl needs to be a member of att.datable but that unitDef does. (i.e. the entire set of things may not be datable, but any individual unit may exist only during a certain period.)

By this comment, I finally understood the reason why we need att.datable. I'm grateful.

<elementSpec ident="unitDef" ns="http://www.example.org/ns/nonTEI"
    mode="add">
    <desc>contains descriptive information related to certain unit.</desc>
    <classes>
        <memberOf key="model.global"/>
        <memberOf key="att.global"/>
        <memberOf key="att.datable"/>
        <memberOf key="att.lexicographic"/>
        <memberOf key="att.canonical"/>
        <memberOf key="att.typed"/>
    </classes>
    <content>
        <alternate maxOccurs="unbounded">
            <elementRef key="unit" minOccurs="1"/>
            <classRef key="model.labelLike"/>
        </alternate>
    </content>
</elementSpec>

(iii) \<unitDecl> element

Why does unitDecl need att.canonical? (I can see an argument for this on unitDef though to point to some external standard equivalent?)

I incidentally made a mistake to insert att.canonical within the definition of \<unitDecl>, so I fixed it.

<elementSpec ident="unitDecl" ns="http://www.example.org/ns/nonTEI"
    mode="add">
    <desc>(unit declarations) provides information about non-SI (the
        International Systems of unit)&#13; units and
        measurement.</desc>
    <classes>
        <memberOf key="model.encodingDescPart"/>
        <memberOf key="att.global"/>
    </classes>
    <content>
        <elementRef key="unitDef" minOccurs="1" maxOccurs="unbounded"/>
    </content>
</elementSpec>

And, I agree with your point:

I think the proposed unit attribute on measure should really be a unitRef if it is going to point to this declaration.

because unitRef attribute would be better than \@ref to explain the semantics presented in the header.

What do you think about this redefined ODD file?

jamescummings commented 6 years ago

@duncdrum All interesting points. I agree entirely about trying to enable users of the TEI to capture data in non culturally-specific methods if they so desire. I also understand that relative measurements of any sort are different in different times and places (whether that be feet, catty, or warp speed). If this data is known then presumably the encoder could provide separate unitDef's for each of the different values of a catty? Then when pointing to them choosing the correct xml:id to point to? Obviously they'd not have to do that if they didn't wanted. If not, does that mean that <precision> should be used to indicate that in this context a catty could be 594g +/- 4g or so? The proposed macro.paraContent I believe gets the model.certLike class as content and so it should be available inside <unit> as proposed.

I am curious though about how you think the TEI may be recommending something contrary to the XML specification. I had a quick look be didn't see where the XML spec said that attributes shouldn't be defined to have values in different languages? While the attribute values in the TEI often appear to be in English they are technically not but tokens matching the datatypes the TEI provides (or maps to W3C and XSD datatypes). However, that isn't how they are viewed in practice I know. But redefining attribute value lists so that they match a project-specific list of values is certainly something the TEI encourages (whatever the language of those values). Where it provides suggested values these are to encourage a convergence to some form of standardisation by those not customising. (Standardisation through the appeal to apathy, I suppose. "If you can't be bothered to come up with a list of values, here are some we suggest.") Are you suggesting that if I have a <name type="gaijin"> (or <name type="外人">) and a <name type="patronymic"> in the same document I'm somehow in violation of the XML spec? Maybe I'm just confused by what you meant.

jamescummings commented 6 years ago

Hi @naoki-kokaze, some more comments:

I define this element by borrowing the model of @laurentromary presented in #1461. In that, @laurentromary argues that att.norm be necessary when it comes to clarifying the complete form of the abbreviated name of the unit (ex. <unit @norm="kilogram">kg). But, if so, I have become to think that it might be better to define as a member of att.lexicographic, as later shown.

I had entirely forgotten the need for @norm, my apologies. That makes sense now.

unitRef attribute would be better than @ref to explain the semantics presented in the header.

This makes sense to me, though I can see how some might say @ref should do dual duty.
@laurentromary would that make sense to you as well? that @ref on <unit> would be about the semantics of the unit and a new attribute @unitRef could be used to point to the <unitDef>? Or is this overkill and should these both be handled by @ref?

What do you think about this redefined ODD file?

I think it is looking really good!

laurentromary commented 6 years ago

I would tend to see this as an overkill. The difference between the two is only that you point to a definition in the header as opposed to some kind of external authorities. In both cases @ref would indicate the reference definition for the unit. Is that what you intended @naoki-kokaze ?

ebeshero commented 6 years ago

@naoki-kokaze @jamescummings @laurentromary Just a quick peep from here since I haven't checked in on this thread since it was introduced--I'm impressed with how quickly this is coming together! I think you've maybe resolved this already, but I'd like to second the use of att.lexicographic on unitDef for reasons of definition really--the kind of work we're doing here is defining and relating, so categorically, having membership in lexicographic makes sense. Indeed, we might want to consider tweaking our description of the attribute class here: currently it seems a weird choice for this purpose because the class is defined simply as "provides a set of attributes common to all elements in the dictionary module". That seems a little too simple, and perhaps we ought to be describing these attributes as distinguishing among different options, and normalizing values...thoughts?

laurentromary commented 6 years ago

Yes, there is some work to be done on att.lexicographic with better definitions, examples etc. We need a new ticket for this.

ebeshero commented 6 years ago

I've opened https://github.com/TEIC/TEI/issues/1720 to help work out a thoughtful rewriting of the description of att.lexicographic.

naoki-kokaze commented 6 years ago

@jamescummings @laurentromary Thank you for giving me feedbacks, which are really encourageous for me. Also I agree with both of your opinions that unitRef attribute seems to be an overkill and ref attribute is already enough for referencing the external information in the header. I appreciate for your comments.

naoki-kokaze commented 6 years ago

sorry for belated response, I'll check to see new ticket!

duncdrum commented 6 years ago

@jamescummings

I had a quick look be didn't see where the XML spec said that attributes shouldn't be defined to have values in different languages?

I am referring to this section about the scope of @xml:lang, which incidentally seems to use a TEI code listing. To me the examples and the explanation are very clear.

<sp who="Faust" desc='leise' xml:lang="de">
  <l>Habe nun, ach! Philosophie,</l>
</sp>

While the attribute values in the TEI often appear to be in English they are technically not but tokens matching the datatypes the TEI provides (or maps to W3C and XSD datatypes)

I don't think that this is true, sticking with the examples from this thread: @type='weight', and @commodity='copper' these are not externally defined or xs datatypes. The SI definition (in english) should be 'mass' not 'weight', 'copper' would be 'Cu'. I think there are many more suggested attribute values in the P5 guidelines for which there is no external international definitions, let alone obvious ones, like 'monograph', 'patronymic', etc. Thus under the scope of @xml:lang="de" these would have to be @type='Gewicht', @type='Masse', and @commodity='Kupfer' to be in line with the xml specs. (note: type not Typ)

Beyond that even if all attribute values were mapped to trans-linguistic authorities, the strings that appear in the xml (the stuff being mapped) is clearly 'en'. Just on a practical level, what human or machine parser will find 外人 in its en, or weight in its de dictionary?

Are you suggesting that if I have a <name type="gaijin"> (or <name type="外人">) and a <name type="patronymic"> in the same document I'm somehow in violation of the XML spec?

No, not as such, but if you have:

<name type="外人"  subtype="patronymic"/>

then there is no possible @xml:lang attribute to cover this. The same with:

…
<person xml:lang='ja'>
  <name type="外人">Kirk</name>
  <name type="patronymic">Tiberious</name>
</person>

which seems to be in line with the current odd customisations mechanism, which retains 'en' default attribute values, or with @ebeshero recommendation that projects are free to forgo consistent transliteration and/or translation in mixed language tei documents above. In those cases I do think that the resulting document violates the xml specs.

Obviously all this can be avoided by using a fully localised version of TEI. Or by using multiple xml:lang attributes. But if you agree with my analysis I think the guidelines should be much more explicit about this particular problem, to prevent valid tei that violates xml specs. If anything this discussions shows, that it is far from clear or obvious what goes and what doesn't.

jamescummings commented 6 years ago

@duncdrum I think you may have a point here. But if one considers it problematic then I'm not so sure the solution is as easy as you suggest. i.e. with the

<name type="外人"  subtype="patronymic"/>

or other examples you give where you note that no single @xml:lang could cover both of these attribute values (or an attribute value and text child nodes). This is a significantly different problem from this issue that you should open a new one. I suspect that the W3C XML working group was not envisioning this problem with multilingual XML documents and would not have wished to forbid such uses. (But maybe I'm wrong...) A new issue would enable us to point them at this query to weigh in.

To expand a bit on the datatypes, the value of @type is a teidata.enumerated datatype which in used to expressly recommend that the values be documented (in the project's ODD), that datatype is made up of a single token of teidata.name which is provided by the TEI as a pattern: teidata.word = token { pattern = "(\p{L}|\p{N}|\p{P}|\p{S})+" }. With @commodity it is defined, poorly in my view, as "1–∞ occurrences of teidata.word separated by whitespace". I'd rather commodities also be pointers to some metadata description, perhaps an <object> element that will be proposed soon. (See the long history of #327). You are right that these are not external definitions, these are, as I said, ones that the TEI provides. Where the TEI provides suggested value lists I believe it tires to use terms that are in common usage, but you are right that these are almost always in English and not necessarily SI descriptions. (Where we do recommend SI is on the existing @unit attribute, but even there recognise people will need to use ad hoc value lists. In all cases it is my understanding that the TEI recommends that users should be customising the TEI by constraining the attribute values to those which make sense within its context. Indeed for @type and @commodity on <measure> it provides no suggested attribute values. 'weight' is indeed used in examples and maybe that should be improved. (Compare this to @type on <num> where it does suggest "cardinal; ordinal; fraction; percentage" as common types of numbers.) But you are right that, for example, in a german or japanese context those users who are using the 'English' values may be making a rod for their own back (and those who use their texts in the future). I suspect most of them are ignoring the language of the attribute values or viewing them as tokens in no specific language. The TEI should potentially clarify this more.

I'm sure the TEI Council would be more than happy to receive individual bug reports on suggested attribute values which are mis-named (since, as these are only suggestions, it can change them without affecting backwards compatibility). Similarly, where it has closed value lists (see for example http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.divLike.html where if the @org or @sample attributes are used they must come from a fixed value list) where the values are incorrect in some way. However, in that case the burden of the problem must overcome the resistance to breaking backwards compatibility.

lb42 commented 6 years ago

I think the multi-language attribute-values issue is not significant. One of the motivations for the long-forgotten war on attributes with text-like values was precisely the realisation that attribute values (like element identifiers) in the TEI had to be considered as not being constructs in any particular language. They just look like english words (some of them -- not all!) : there's no reason why your ODD should not define appropriate tokens in Japanese for the values of an attribute (though you might consider being kinder to the westerner/gaiji by transliterating them) while leaving the values of others unchanged from the (apparently English) tokens proposed by the TEI.

duncdrum commented 6 years ago

@lb42 while I would love to hear about the long forgotten attribute war, that is not what the specs say. They explicitly mention attribute values twice.

@jamescummings done, I m actually willing to bet one catty (or pint) of beer, that this scenario is exactly what W3C XML working group was envisioning. Collectible at our first F2F.

lb42 commented 6 years ago

One of the motivations for the long-forgotten war on attributes with text-like values was precisely the realisation that attribute values (like element identifiers) in the TEI had to be considered as not being constructs in any particular language. They just look like english words (some of them -- not all!) : there's no reason why your ODD should not define appropriate tokens in Japanese for the values of an attribute (though you might consider being kinder to the westerner/gaiji by transliterating them) while leaving the values of others unchanged from the (apparently English) tokens proposed by the TEI.

Your argument would have us consider that e.g. <encodingDesc xml:lang="de"> was also incoherent. But I agree the Guidelines do not make this point very explicit.

lb42 commented 6 years ago

I note also that the XML spec says "Applications determine which of an element's attribute values and which parts of its character content, if any, are treated as language-dependent values described by xml:lang." So even W3C doesn't really believe that things are as cut and dried as you suggest.

duncdrum commented 6 years ago

@lb42 I m all on board with @id="T0091wackawack" being in no language space, but I don't see how weight is not english, look at the example @desc="leise" Also where do the Guidelines define if <bibl type="monograph"> is considered english, or tei private use language?

lb42 commented 6 years ago

Well, as the passage I quoted above shows, it is up to an application to decide what it thinks. In the case of the TEI (and this is what we need to make more explicit) "weight" is NOT english. It is "TEIspeak". I think you're attaching too much weight to the @desc="leise" example. I nmight well (if I am doiing a multilingual project) have an ODD in which "leise" "lit" "reads" etc are all available as possible values for the @desc attribute on <stage>. That still doesn't make any of them belong to the xml:lang-defined language space.

jamescummings commented 6 years ago

@duncdrum: I would be happy to buy you a pint of beer (not a catty of beer since now I'm unsure how much that would be) as I'm sure there would be interesting conversation. If the W3C truly wanted to forbid truly multi-language documents then I think that was a poor decision. I can understand why though, much easier to process documents if they are all in the same language or use a simple @xml:lang mechanism. I feel one should be able to have one attribute with values in one language, another attribute with values in another language, and completely different language for the element content if your use-case makes this the best way to express the data. I suspect people do this in the wild. About weight, if we changed 'weight' to be 'w31ght' would that be better? ;-)

@lb42 while I agree with you that attribute values (suggested or otherwise) are non-language-specific tokens, in the same way that element names are, we all know that people think <name> and indeed <persName> are English or English-derived values. I don't mind this at all as someone who likes torturing the English language. ;-) I agree that TEI should be clearer on the language spaces in use and how that relates to the XML Spec.

However, let's move any further comments on this aspect to #1721 and keep this issue for @naoki-kokaze's improvement on units. (Which I think is ready for @ebeshero to progress by badgering council, making sure it works with #1461, and testing the suggested ODD.)

naoki-kokaze commented 6 years ago

@duncdrum @jamescummings @lb42 Thank you for your really interesting discussions on multi-lingual usage in TEI and our 'TEIspeak' ing. Also, I love "a catty of beer"!

Especially, ＠jamescummings I'm really grateful for your managing our discussions and checking the ODD file at the same time. I am looking forward to improving the ODD through technical feedback from council members @ebeshero. Wishing you all the best.

jamescummings commented 6 years ago

@naoki-kokaze No problem, this has gone from poster to proposed ODD quite quickly, let's see how long it now takes to actually get into the Guidelines. ;-) [@duncdrum If using modern catty measurements and we round 1 catty to approx 605 grams, then since a proper pint is 568mL, depending on the density of the beer in question let's say 1.050, then 1 catty of beer is just a bit more than 1 pint of beer. Maybe slightly less if it is porter.]

ebeshero commented 6 years ago

We're creating a subgroup to concentrate on this issue: @emylonas will convene all of us discussing this ticket in March.

ebeshero commented 6 years ago

@jamescummings @sydb @emylonas Checking in: where are we on this ticket, and what can we do on it ahead of the upcoming release?

ebeshero commented 6 years ago

Rephrasing my last: This is also effectively checking in on #1461 and #1720 , both connected to this ticket. I'm working on #1720 now, in case it will help us here.

jamescummings commented 6 years ago

I think the elementSpecs produced by @naoki-kokaze at https://github.com/TEIC/TEI/issues/1707#issuecomment-346844499 have consensus (much of the discussion after that is interesting but not really about this ticket). Barring a few tweaks here and there (and their relation to the other tickets), I'm still of the opinion that this is in an almost implementable state in its 24 November 2017 iteration.

ebeshero commented 6 years ago

+1 from me, but hoping we can work out the next steps: we approved the new element <unit>, but do we need to implement it first, and then set up the elementSpecs worked out here, right? I wonder if we want to do this work in the face-to-face together, or ahead of it... (Okay, or right now, but there's just one day to freeze before our next release.)

laurentromary commented 6 years ago

Well, if we could have <unit> on the fly before the release, that would be great!

ebeshero commented 6 years ago

@laurentromary I bet we can do that much, anyway... Stay tuned...

TEIC / TEI

How to encode measurement #1707

Multi-lingual XML

Measurement