Closed duncdrum closed 2 years ago
Wouldn't the example be:
<charProp>
<unicodeName>General_Category</unicodeName>
<value>Lo</value>
</charProp>
<charProp>
<uniHan>kRSKangXi</uniHan>
<value>120.5</value>
</charProp>
Would you imagine using it under both <char>
and <glyph>
, or only one of them?
Good catch, fixed the listing, yes I'd say both glyph
and char
keeping things the way they work now.
@duncdrum could you provide a sort of "worked example" of how you would use this in a real-life situation, explained at a level the likes of me would understand?
Also, my first reading of UAX #38 leads me to believe that you could look up a character in the Unihan db based on its Unicode codepoint. I guess that what this means is that the same Unicode codepoint may have multiple UniHan entries, and therefore that the use of this element would be primarily to specify that in the document in question, the use of this particular Unicode character should be understood in the context of a given UniHan entry. Am I understanding correctly, or is this overly simplistic?
Is the list of unihan properties fixed or are they likely to change (and if so, how frequently)?
Council open meeting suggests watching panel presentations tomorrow, and then put the ticket on the agenda for next Council meeting.
@jamescummings the list is not fixed. The unihan database works like the unicode character database so that individual character codepoints can have any number of 4-90 properties with defined values (i don't think there are any booleans in this one)
New properties can and will be added if e.g. a national body embarks on a new character encoding design (currently underway for the PRC) or new standard dicitionaries make them desirable. As you can guess these are not frequent events, looking back I'd say new properties get added every 3-4 years, however, there is nothing to prevent x number of properties getting added tomorrow. New value entries for existing properties happen with every release.
Properties can be deprecated, but this will mean that no new values will be entered into the database, existing entries remain unchanged. Tei encoders might sensible wish to to supply values for such properties.
The main point of using the element in TEI is to supply property-value pairs for existing codepoints in a structured fashion. This applies to a large number of only partially included authority references. Ideally, a valid tei segment could greatly simplify the submissions request for new values to be added to the database, which relies on user submissions. More importantly the reason for having its own element is making properties available for PUA characters inside tei documents, which by definition have no unicode or unihan properties, and therefore fall through all the cracks of normal text processing tasks. Validating normative unihan properties, and their associated value entries would ensure that PUA characters can be processed by any unicode compliant application.
In the new uniHan spec, it looks like the content model is defined with RNG. I'm guessing you did this because pureODD is not yet processed correctly. But I see from the discussion above that it should be encoding property-value pairs, not just properties -- did I get that wrong?
Ping @duncdrum : In the new uniHan spec, it looks like the content model is defined with RNG. I'm guessing you did this because pureODD is not yet processed correctly. But I see from the discussion above that it should be encoding property-value pairs, not just properties -- did I get that wrong?
@martindholmes sorry for missing your question earlier. You are correct, rng wa because of pureODD processing problems, and it should be key-value pairs.
@duncdrum & @martindholmes : To what RELAX NG schema are you two referring? I.e., where is the proposed UniHan spec? It may be just that I am extra tired this morning, but I don’t remember where it is, and don’t see any links here (or in #1804) other than to the UC’s schema.
@sydb The spec for the unihan element doesn't use Pure ODD, unfortunately, so it couldn't be integrated as it is. @duncdrum suggests that when he wrote it in Pure ODD, there was a processing problem, so I think before we can do anything else we would need to raise a ticket for that (if there isn't one already) and fix it. We don't want to re-introduce RNG content models into the source.
This issues has been discussed in dedicated meetings.
Short: Our recommendation is that <charProp>
is replaced by three separate elements: <unicodeProp>
, <unihanProp>
, and <localProp>
.
Each of these has a @name
attribute. The values of unicodeProp/@name
and unihanProp/@name
could be constrained to a controlled vocabulary; the values of localProp/@name
would obviously not be constrained to a list, but likely would be limited to xs:NCName
or some such.
The subgroup came up with the following proposal to include unihan properties:
att.gaijiProperties
with @name
and @value
, to be used by <unicodeProp>
, <localProp>
, and <unihanProp>
.<unicodeProp>
, <unihanProp>
, and <localProp>
elements with closed valList
for unicodeProp/@name
and unihanProp/@name
, along with prose changes. (This will likely necessitate some work on Stylesheets so it looks OK.) This can be done now, because we don’t need to fix the element content model processing bug. Prose must be added too.
NOTE: The entry for charProp
in the content models of <char>
and <glyph>
becomes instead ( charProp* | ( unicodeProp | unihanProp | localProp )* )
, and a) <charProp>
is deprecated (for years), and b) the three new elements are members of new class that has a required @name
(which used to be the content) and a maybe required @value
(which used to be the sibling <value>
).<value>
, <property>
and <charProp>
@values
based on @name
and regexes from TR 38. If it’s not too onerous to run and to maintain, integrate it. unicodeName
to deprecate the old mechanisms and recommend the new attributes.@name
and @value
.Subgroup decided to go with the proposed version with attributes instead of elements to express name-value pairs:
current encoding method:
<char>
<charProp>
<unicodeName>character-decomposition-mapping</unicodeName>
<value>circle</value>
</charProp>
</char>
new encoding method:
<char>
<unicodeProp name="Decomposition_Mapping" value="circle"/>
</char>
Thus the example in the tagdoc for
<char xml:id="circledU4EBA">
<charName>CIRCLED IDEOGRAPH 4EBA</charName>
<unicodeProp name="character-decomposition-mapping" value="circle"/>
<localProp name="daikanwa" value="36"/>
<mapping type="standard">人</mapping>
</char>
@duncdrum offered to start with Phase 0 and Phase 1 and will do a pull request.
that last example should be
<char xml:id="circledU4EBA">
<charName>CIRCLED IDEOGRAPH 4EBA</charName>
<unicodeProp name="Decomposition_Mapping" value="circle"/>
<localProp name="daikanwa" value="36"/>
<mapping type="standard">人</mapping>
</char>
looking for input: there are some UCD name-only-values, they are there or not. With the new elements we could allow e.g.
<unicodeProp name="blah"/>
to mirror this, and subsequently allow similar structures on <localProp>
. However, I'm more inclined to make @value
mandatory and require one of the following alternatives:
<unicodeProp name="blah" value="True"/>
<unicodeProp name="blah" value="blah"/>
But this is all gut feeling, does anyboy have an argument for or against any of these options?
I would actually like to move the prose and example changes in the Guidelines Section, into phase 2. Once i m clearer on the deprecation procedure.
@duncdrum Your second approach to name-only values is a little like the XHTML5 approach, which sets the attribute to a value identical with its name:
selected="selected"
So I think I like #2 best.
@martindholmes yup that was my inspiration, however i noticed that the old localName
specs contain a recommendation for Entity
no sure where that came from but i ll dig around abit before i introduce a change.
so here is something we missed in our discusson: charName
(and glyphName
) is really syntactic sugar for the unicode name
property. Granted we can use if also for localName
characters, but it isn't very helpful there, as the Guidelines already acknowledge.
Since i m changing this whole gig, wouldn't it make more sense to also deprecate charName
and switch to:
<unicodeProp name="name" value="CIRCLED IDEOGRAPH 4EBA"/>
<!-- or for local -->
<localProp name="name" value="duncdrum reign symbol"/>
@martindholmes i ended up with True
, False
UCD defines these name only as boolean, so using xml boolean seems less confusing.
note to self unit-test for count of properties which should only ever increase
The content of the two <desc>
elements (one has a type="deprectationInfo") in the deprecated specifications (charProp
, value
, charName
, glyph
) are issued consecutively without a space, starting with a lower case letter. See https://jenkins.tei-c.org/job/TEIP5-dev/lastSuccessfulBuild/artifact/P5/release/doc/tei-p5-doc/en/html/ref-charProp.html How did we handle this so far?
I do not think this circumstance has ever occurred before. But it is a release blocker, as far as I’m concerned.
My first thought is that some part of the processing is looking for tei:desc
when it should be looking for tei:desc[not(@type eq 'deprecationInfo')]
.
I’m pretty busy on #1625 right now, but hope to look at this tomorrow.
first off 🎉
note to self about reverting the <glyph>
deprecation in 85572ec62a4ac0f7df1962be744d8c7775147db9. @sydb raised the question of adding <char>
to att.typed
my question however remains if that is still necessary since <g/>
already is a member. Basically if editors want to highlight the distinction between character and glyph variants, they should do so in the main text, thus being able to make do with a single definition which in one context can be a glyph variant, but in another a char.
This is a a slightly adopted example from UTS#37 and valid as of 4.0.0
:
<p>
<choice>
<sic><g ref="#ashi4" type="glyph"/></sic>
<corr>芦</corr>
</choice>田さんは<g ref="#ashi4" type="character"/>屋のお嬢様だ
</p>
There is more to this, based on the fixed context inherited by the @encoding
of the xml prolog. Basically whatever scholarly distinctions we imitate, that one provides a hard context for what is and what isn't a char inside any tei-xml document.
@duncdrum the council is meeting (virtually) for our face-to-face and we'd like to ask you about your plan for the next phases (I believe it's phase 3 onward though you did some of 5 for the last release as well, right?)
@raffazizzi yes my plan is to have the changes (including phase5) ready for the next release. If all goes according to plan, they should be top of my TODO pile in about 2 weeks give or take.
@duncdrum we're now working on the next release and I've acted to late to send you a timely reminder :( Either way, let us know if you have any updates!
@raffazizzi yes this will not be done for the next release, other things have bumped up the priority pile, sorry I ll get back to it as soon as things have settled down again, so it is not abandoned.
Hi, the latest version of UAX#38 (Unihan), released with Unicode 13.0, has rather big changes in property fields.
- Reissued for Unicode 13.0.0.
- Updated regular expressions for kIRG_GSource, kIRG_HSource, kIRG_JSource, kIRG_KPSource, kIRG_KSource, kIRG_TSource, kIRG_USource, and kIRG_VSource.
- Added kIRG_SSource, kIRG_UKSource, kTGHZ2013, kSpoofingVariant, and kUnihanCore2020 fields.
- Removed kRSJapanese, kRSKanWa, and kRSKorean fields.
- Revised format of tables in Sections 4.2, 4.3, and 4.4 for legibility.
- Added a "Count" column to the tables in Section 4.4.
- Moved kTotalStrokes from Unihan_DictionaryLikeData.txt to Unihan_IRGSources.txt.
The newly added field names currently seem unavailable in TEI, and would be very much appreciated if included in the next version. Removed ones don't have to be deleted, as it affects backward compatibility.
att.gaijiProp
currently contains an attribute version
, that is given a description "specifies the version number of the Unicode Standard in which this property name is defined." However, the fact that all <unicodeProp>
, <unihanProp>
, and <localProp>
share that attribute could make it counterintuitive. Not to mention that Unihan has its own versioning scheme, any regional or specialized character set that can be cited with <localProp>
can have its own versions irrelevant to Unicode. Could the definition of version
be changed to specify the version of what each element denotes? Or have there been such a discussion on this matter (if so, a pointer to it will be appreciated)?
@747 This attribute has been included unchanged from the previous way of dealing with gaiji to introduce minimal changes
. I take your point about it with respect to localProp
but the discussion of @version
is in separate tickets, and only indirectly related to this issue.
This looks merged and completed; closing
The UAX #38 unicode han database (unihan)
is an important resources for folks encoding ideographic (as the Unicode standards uses the term) texts. Eg. It expresses, among other things, sorting rules, variant relations, dictionary references, and most importantly for the TEI imv, it contains normalized references to different national standards.
Currently encoders can use
Of the ~90 properties only a hand full apply to
unicodeName
proper, only in the rarest of cases, do uniHan properties apply to texts not using CJK(V). Since the use cases and audiences are fairly distinct, and since the UC is treating them separately, I'd like to suggest that tei follows UC, and adds a newuniHan
element, to make these properties available in a structured fashion.The alternatives would be to allow
@type
onunicodeName
and make it a closed list. But this seems a dangerous path, leading to how about@subtype
? Similarly adding all possible values as legal contents tounicodeName
seems a bad idea, since they don't apply equally to all users of unicodeName, so they shouldn't be forced to dig through 200 values just to find the one they need. Crucially, this would weaken the close alignment between unicode property names as the UC understands the term and as TEI would then implement it. Not all unihan properties are part of the ucd rnc schema; see #1804 .The content model of the suggested element would closely mirror that of
unicodeName
, be part of the gaiji modul, and appear incharProp
. In essence, the following should become valid tei.Biggest challenge: UC defines legal values using RegEx features (grapheme matching) that no xml regex engine I know of currently supports. About 20 would need to be rewritten, to produce similar results.