TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
274 stars 88 forks source link

Suggestion for new uniHan element #1805

Closed duncdrum closed 2 years ago

duncdrum commented 6 years ago

The UAX #38 unicode han database (unihan)

is an important resources for folks encoding ideographic (as the Unicode standards uses the term) texts. Eg. It expresses, among other things, sorting rules, variant relations, dictionary references, and most importantly for the TEI imv, it contains normalized references to different national standards.

Currently encoders can use

<localName type='some-national-standard'>some-name</localName>
<value>some value</value>
<!-- or -->
<localName type='unihan'> kRSKangXi </localName>
<value>120.5</value> 
<!-- or -->
<localName type='uax38'> kRSKangXi </localName>
<value>120.5</value> 
...

Of the ~90 properties only a hand full apply to unicodeName proper, only in the rarest of cases, do uniHan properties apply to texts not using CJK(V). Since the use cases and audiences are fairly distinct, and since the UC is treating them separately, I'd like to suggest that tei follows UC, and adds a new uniHan element, to make these properties available in a structured fashion.

The alternatives would be to allow @type on unicodeName and make it a closed list. But this seems a dangerous path, leading to how about @subtype? Similarly adding all possible values as legal contents to unicodeName seems a bad idea, since they don't apply equally to all users of unicodeName, so they shouldn't be forced to dig through 200 values just to find the one they need. Crucially, this would weaken the close alignment between unicode property names as the UC understands the term and as TEI would then implement it. Not all unihan properties are part of the ucd rnc schema; see #1804 .

The content model of the suggested element would closely mirror that of unicodeName, be part of the gaiji modul, and appear in charProp. In essence, the following should become valid tei.

<charProp>
  <unicodeName>General_Category</unicodeName>
  <value>Lo</value>
</charProp>
<charProp>
  <uniHan>kRSKangXi</uniHan>
  <value>120.5</value>
</charProp>

Biggest challenge: UC defines legal values using RegEx features (grapheme matching) that no xml regex engine I know of currently supports. About 20 would need to be rewritten, to produce similar results.

martindholmes commented 6 years ago

Wouldn't the example be:

<charProp>
  <unicodeName>General_Category</unicodeName>
  <value>Lo</value>
</charProp>
<charProp>
  <uniHan>kRSKangXi</uniHan>
  <value>120.5</value>
</charProp>

Would you imagine using it under both <char> and <glyph>, or only one of them?

duncdrum commented 6 years ago

Good catch, fixed the listing, yes I'd say both glyph and char keeping things the way they work now.

martindholmes commented 6 years ago

@duncdrum could you provide a sort of "worked example" of how you would use this in a real-life situation, explained at a level the likes of me would understand?

Also, my first reading of UAX #38 leads me to believe that you could look up a character in the Unihan db based on its Unicode codepoint. I guess that what this means is that the same Unicode codepoint may have multiple UniHan entries, and therefore that the use of this element would be primarily to specify that in the document in question, the use of this particular Unicode character should be understood in the context of a given UniHan entry. Am I understanding correctly, or is this overly simplistic?

jamescummings commented 6 years ago

Is the list of unihan properties fixed or are they likely to change (and if so, how frequently)?

martindholmes commented 6 years ago

Council open meeting suggests watching panel presentations tomorrow, and then put the ticket on the agenda for next Council meeting.

duncdrum commented 6 years ago

@jamescummings the list is not fixed. The unihan database works like the unicode character database so that individual character codepoints can have any number of 4-90 properties with defined values (i don't think there are any booleans in this one)

New properties can and will be added if e.g. a national body embarks on a new character encoding design (currently underway for the PRC) or new standard dicitionaries make them desirable. As you can guess these are not frequent events, looking back I'd say new properties get added every 3-4 years, however, there is nothing to prevent x number of properties getting added tomorrow. New value entries for existing properties happen with every release.

Properties can be deprecated, but this will mean that no new values will be entered into the database, existing entries remain unchanged. Tei encoders might sensible wish to to supply values for such properties.

The main point of using the element in TEI is to supply property-value pairs for existing codepoints in a structured fashion. This applies to a large number of only partially included authority references. Ideally, a valid tei segment could greatly simplify the submissions request for new values to be added to the database, which relies on user submissions. More importantly the reason for having its own element is making properties available for PUA characters inside tei documents, which by definition have no unicode or unihan properties, and therefore fall through all the cracks of normal text processing tasks. Validating normative unihan properties, and their associated value entries would ensure that PUA characters can be processed by any unicode compliant application.

martindholmes commented 6 years ago

In the new uniHan spec, it looks like the content model is defined with RNG. I'm guessing you did this because pureODD is not yet processed correctly. But I see from the discussion above that it should be encoding property-value pairs, not just properties -- did I get that wrong?

martindholmes commented 5 years ago

Ping @duncdrum : In the new uniHan spec, it looks like the content model is defined with RNG. I'm guessing you did this because pureODD is not yet processed correctly. But I see from the discussion above that it should be encoding property-value pairs, not just properties -- did I get that wrong?

duncdrum commented 5 years ago

@martindholmes sorry for missing your question earlier. You are correct, rng wa because of pureODD processing problems, and it should be key-value pairs.

sydb commented 5 years ago

@duncdrum & @martindholmes : To what RELAX NG schema are you two referring? I.e., where is the proposed UniHan spec? It may be just that I am extra tired this morning, but I don’t remember where it is, and don’t see any links here (or in #1804) other than to the UC’s schema.

martinascholger commented 5 years ago

@sydb: see [https://github.com/duncdrum/TEI/commit/51e8d1afbefc7f97988f8312dec64ce649e04791]

martindholmes commented 5 years ago

@sydb The spec for the unihan element doesn't use Pure ODD, unfortunately, so it couldn't be integrated as it is. @duncdrum suggests that when he wrote it in Pure ODD, there was a processing problem, so I think before we can do anything else we would need to raise a ticket for that (if there isn't one already) and fix it. We don't want to re-introduce RNG content models into the source.

martinascholger commented 5 years ago

This issues has been discussed in dedicated meetings.

Short: Our recommendation is that <charProp> is replaced by three separate elements: <unicodeProp>, <unihanProp>, and <localProp>. Each of these has a @name attribute. The values of unicodeProp/@name and unihanProp/@name could be constrained to a controlled vocabulary; the values of localProp/@name would obviously not be constrained to a list, but likely would be limited to xs:NCName or some such.

The subgroup came up with the following proposal to include unihan properties:

Subgroup decided to go with the proposed version with attributes instead of elements to express name-value pairs:

current encoding method:

<char>
      <charProp>
         <unicodeName>character-decomposition-mapping</unicodeName>
   <value>circle</value>
      </charProp>
   </char>    

new encoding method:

<char>
      <unicodeProp name="Decomposition_Mapping" value="circle"/>
   </char>

Thus the example in the tagdoc for would become:

<char xml:id="circledU4EBA">
    <charName>CIRCLED IDEOGRAPH 4EBA</charName>
    <unicodeProp name="character-decomposition-mapping" value="circle"/>
    <localProp name="daikanwa" value="36"/>
    <mapping type="standard">人</mapping>
  </char>

@duncdrum offered to start with Phase 0 and Phase 1 and will do a pull request.

duncdrum commented 5 years ago

that last example should be

<char xml:id="circledU4EBA">
  <charName>CIRCLED IDEOGRAPH 4EBA</charName>
  <unicodeProp name="Decomposition_Mapping" value="circle"/>
  <localProp name="daikanwa" value="36"/>
  <mapping type="standard">人</mapping>
</char>
duncdrum commented 5 years ago

looking for input: there are some UCD name-only-values, they are there or not. With the new elements we could allow e.g.

<unicodeProp name="blah"/>

to mirror this, and subsequently allow similar structures on <localProp>. However, I'm more inclined to make @value mandatory and require one of the following alternatives:

<unicodeProp name="blah" value="True"/>
<unicodeProp name="blah" value="blah"/>

But this is all gut feeling, does anyboy have an argument for or against any of these options?

duncdrum commented 5 years ago

I would actually like to move the prose and example changes in the Guidelines Section, into phase 2. Once i m clearer on the deprecation procedure.

martindholmes commented 5 years ago

@duncdrum Your second approach to name-only values is a little like the XHTML5 approach, which sets the attribute to a value identical with its name:

selected="selected"

So I think I like #2 best.

duncdrum commented 5 years ago

@martindholmes yup that was my inspiration, however i noticed that the old localName specs contain a recommendation for Entity no sure where that came from but i ll dig around abit before i introduce a change.

duncdrum commented 5 years ago

so here is something we missed in our discusson: charName (and glyphName) is really syntactic sugar for the unicode name property. Granted we can use if also for localName characters, but it isn't very helpful there, as the Guidelines already acknowledge.

Since i m changing this whole gig, wouldn't it make more sense to also deprecate charName and switch to:

<unicodeProp name="name" value="CIRCLED IDEOGRAPH 4EBA"/>
<!-- or for local -->
<localProp name="name" value="duncdrum reign symbol"/>

@martindholmes i ended up with True, False UCD defines these name only as boolean, so using xml boolean seems less confusing.

duncdrum commented 4 years ago

note to self unit-test for count of properties which should only ever increase

martinascholger commented 4 years ago

The content of the two <desc> elements (one has a type="deprectationInfo") in the deprecated specifications (charProp, value, charName, glyph) are issued consecutively without a space, starting with a lower case letter. See https://jenkins.tei-c.org/job/TEIP5-dev/lastSuccessfulBuild/artifact/P5/release/doc/tei-p5-doc/en/html/ref-charProp.html How did we handle this so far?

sydb commented 4 years ago

I do not think this circumstance has ever occurred before. But it is a release blocker, as far as I’m concerned. My first thought is that some part of the processing is looking for tei:desc when it should be looking for tei:desc[not(@type eq 'deprecationInfo')]. I’m pretty busy on #1625 right now, but hope to look at this tomorrow.

duncdrum commented 4 years ago

first off 🎉

note to self about reverting the <glyph> deprecation in 85572ec62a4ac0f7df1962be744d8c7775147db9. @sydb raised the question of adding <char> to att.typed my question however remains if that is still necessary since <g/> already is a member. Basically if editors want to highlight the distinction between character and glyph variants, they should do so in the main text, thus being able to make do with a single definition which in one context can be a glyph variant, but in another a char.

This is a a slightly adopted example from UTS#37 and valid as of 4.0.0:

<p>
  <choice>
    <sic><g ref="#ashi4" type="glyph"/></sic>
    <corr>芦</corr>
   </choice>田さんは<g ref="#ashi4" type="character"/>屋のお嬢様だ
</p>

There is more to this, based on the fixed context inherited by the @encoding of the xml prolog. Basically whatever scholarly distinctions we imitate, that one provides a hard context for what is and what isn't a char inside any tei-xml document.

raffazizzi commented 4 years ago

@duncdrum the council is meeting (virtually) for our face-to-face and we'd like to ask you about your plan for the next phases (I believe it's phase 3 onward though you did some of 5 for the last release as well, right?)

duncdrum commented 4 years ago

@raffazizzi yes my plan is to have the changes (including phase5) ready for the next release. If all goes according to plan, they should be top of my TODO pile in about 2 weeks give or take.

raffazizzi commented 4 years ago

@duncdrum we're now working on the next release and I've acted to late to send you a timely reminder :( Either way, let us know if you have any updates!

duncdrum commented 4 years ago

@raffazizzi yes this will not be done for the next release, other things have bumped up the priority pile, sorry I ll get back to it as soon as things have settled down again, so it is not abandoned.

747 commented 3 years ago

Hi, the latest version of UAX#38 (Unihan), released with Unicode 13.0, has rather big changes in property fields.

  • Reissued for Unicode 13.0.0.
  • Updated regular expressions for kIRG_GSource, kIRG_HSource, kIRG_JSource, kIRG_KPSource, kIRG_KSource, kIRG_TSource, kIRG_USource, and kIRG_VSource.
  • Added kIRG_SSource, kIRG_UKSource, kTGHZ2013, kSpoofingVariant, and kUnihanCore2020 fields.
  • Removed kRSJapanese, kRSKanWa, and kRSKorean fields.
  • Revised format of tables in Sections 4.2, 4.3, and 4.4 for legibility.
  • Added a "Count" column to the tables in Section 4.4.
  • Moved kTotalStrokes from Unihan_DictionaryLikeData.txt to Unihan_IRGSources.txt.

The newly added field names currently seem unavailable in TEI, and would be very much appreciated if included in the next version. Removed ones don't have to be deleted, as it affects backward compatibility.

747 commented 3 years ago

att.gaijiProp currently contains an attribute version, that is given a description "specifies the version number of the Unicode Standard in which this property name is defined." However, the fact that all <unicodeProp>, <unihanProp>, and <localProp> share that attribute could make it counterintuitive. Not to mention that Unihan has its own versioning scheme, any regional or specialized character set that can be cited with <localProp> can have its own versions irrelevant to Unicode. Could the definition of version be changed to specify the version of what each element denotes? Or have there been such a discussion on this matter (if so, a pointer to it will be appreciated)?

duncdrum commented 3 years ago

@747 This attribute has been included unchanged from the previous way of dealing with gaiji to introduce minimal changes. I take your point about it with respect to localProp but the discussion of @version is in separate tickets, and only indirectly related to this issue.

raffazizzi commented 2 years ago

This looks merged and completed; closing