Closed tari closed 1 year ago
Interesting!
I was wondering, hasn't @jacobly0 also established some kind of similar mapping for his font? Trying some of your unicode characters seems to produce the expected glyph anyway :)
I did refer to Jacobly's TICELarge while developing this, or at least the code has a comment that refers to it as well as a few other sources (like the large font table from 83pa28d). They differ in some choices mostly because I was looking for the most semantically-appropriate character for less common symbols, which usually means choosing a codepoint in one of the mathematical symbol pages. Use of private-use characters also differs in that I chose to use U+F8300 (because the 83 in that number seems semantically appropriate).
I'd like to make the following proposal for incorporating all this data in a sensible way, based on earlier discussions on Discord.
<lang>
tags. We replace all <name>
tags with either <ascii>
or <unicode>
, inserting Tari's names as derived by composing the symbols for each character. The tags should be ordered so that the first of either type is something of a "canonical" choice (though this could also be accomplished by an attribute on/in the tag). Additionally, the sequence of font bytes which give the name on-calc could be added.
<lang code="en">
<ascii>>Dec</ascii>
<unicode>►Dec</unicode>
<chars>05446563</chars> <!-- exact format very negotiable -->
</lang>
font.xml
, which contains just the font characters in a simplified format (i.e. no <lang>
or <since>
tags). This can be derived directly from Tari's file, though we should also add purely ASCII representations where possible. This file could also contain information such as on-calc display details, or a companion file for the small font (though I'm not sure how doable that is).
<byte value="$01">
<ascii>n</ascii>
<unicode>𝑛</unicode>
</byte>
The changes to the token sheets would be beneficial if future applications need to be ASCII-only for whatever reason, while the new font sheet would be useful for direct font translation. Var names, for example, are (usually) given by the font rather than by tokens (and don't coincide in some important cases), so tivars_lib_py
and others could leverage the sheet directly.
Converting the <name>
tags is easy programmatically (just check the bytes), and Tari's file is a gimme to parse for the font XML, though the potential <chars>
tag would take at least a bit of manual work (though equal work to deriving canonical Unicode names from the font).
The tags should be ordered so that the first of either type is something of a "canonical" choice (though this could also be accomplished by an attribute on/in the tag).
I would prefer to use an attribute to mark the canonical version, because determining ordering post-hoc may be difficult for some consumers (imagine a library that deserializes elements into unordered sets).
we should also add purely ASCII representations where possible
I don't think this is very useful because Unicode normalization (NFKD or NFKC) handles the obvious cases, and most of the other things are impossible to represent with pure ASCII. Normalization probably isn't a good idea either, as Unicode TR15 notes:
some characters with compatibility decompositions are used in mathematical notation to represent a distinction of a semantic nature; replacing the use of distinct character codes by formatting in such contexts may cause problems
Providing a pure-ASCII version of each token seems like the correct option, simply because the semantics of a given character are often dependent on the token it's contained in and providing easy-to-type aliases is important to applications that want to tokenize source code.
Additionally, the sequence of font bytes which give the name on-calc could be added. ... Add font.xml, which contains just the font characters in a simplified format
Only one of these should be used:
Providing all three would potentially allow the unicode version of a token to better capture its semantics even if the calculator character set does not, but also makes it difficult to keep them in sync if we wanted to modify the font mapping (every token using the changed character would also need to be updated).
Putting all these together, I suggest something like this:
(The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.)
<token>
<lang code="en">
<canonical chars="05446563" />
<alternate>>Dec</alternate>
</lang>
</token>
Each language's entry for a token MUST contain a canonical
element with a chars
attribute specifying the calculator characters used to display it, and a canonical token
MAY include text to specify a preferred Unicode representation distinct from the Unicode string formed by applying character set mapping to the value of the chars
attribute (I expect such alternate preferred encodings to be rare).
Each token MAY have one or more alternate
s, specifying accepted alternative Unicode representations in the element body. Let the canonical
and alternate
elements be defined as "textual representation"s of the corresponding token.
There SHOULD exist at least one textual representation for each token containing only printable ASCII characters (U+0020-U+007E). This ensures that most tokens can easily be typed by users without resorting to manual input of uncommon codepoints.
Any textual representation element MUST NOT have the same body text as any other textual representation with the same language code. This ensures that tokens are unambiguous for tokenization applications.
I would prefer to use an attribute to mark the canonical version, because determining ordering post-hoc may be difficult for some consumers (imagine a library that deserializes elements into unordered sets).
I didn't lead with the suggestion since a) the current sheets don't do this and b) I don't love the idea of us deciding what's canonical, though any (good) choice we make is unlikely to be controversial (particularly if the canonical choice is the Unicode approximation).
I don't think this is very useful because Unicode normalization (NFKD or NFKC) handles the obvious cases
That's fair, I was mostly just wanting to make sure such an option does exist (but any library under the sun will be able to do this, so). Would we then simplify the font XML format even more?
<byte value="$01">𝑛</byte>
Providing all three would potentially allow the unicode version of a token to better capture its semantics even if the calculator character set does not, but also makes it difficult to keep them in sync if we wanted to modify the font mapping (every token using the changed character would also need to be updated).
Something like a GitHub Action or other form of CI could accomplish this, as I don't like the idea of a user needing to include the font map if they all need are tokens. This could instead be provided on the user end, but the code should be regardless standard and accessible (not that its difficult, but we might as well do the work ourselves).
There SHOULD exist at least one textual representation for each token containing only printable ASCII characters (U+0020-U+007E). This ensures that most tokens can easily be typed by users without resorting to manual input of uncommon codepoints.
Should these be specially delineated? Some kind of tag or attribute would make the check easy for ASCII-only applications.
Also, does anyone with more XML knowhow have suggestions about the chars
attribute? Just a string of hex digits in a string feels cryptic, so if there's some good way to do this I would want to do it.
Would we then simplify the font XML format even more?
That seems reasonable to me.
Something like a GitHub Action or other form of CI could accomplish this, as I don't like the idea of a user needing to include the font map if they all need are tokens.
Also reasonable; source data should avoid redundancy, but we can provide it in alternate forms that may be easier to consume.
Should these be specially delineated? Some kind of tag or attribute would make the check easy for ASCII-only applications.
If a consumer needs to understand unicode to handle the input specification anyway, it's easy for them to check what codepoints are used and filter elements; tagging ASCII-only ones specially seems like another form of redundancy that we don't want. It would be reasonable to provide an ASCII-only version of the data similar to the fontmap-combined one, though.
Also, does anyone with more XML knowhow have suggestions about the chars attribute? Just a string of hex digits in a string feels cryptic, so if there's some good way to do this I would want to do it.
Perhaps use a regular string and use numerical character references as needed? It might be less compact but seems more semantically correct:
<token>
<lang code="en">
<canonical chars="Dec" />
<alternate>>Dec</alternate>
</lang>
</token>
and for regularity, the font map could do the same thing
<byte value="">𝑛</byte>
XML forbids null bytes even when encoded as a character reference, but null has the same semantics in the character set so that seems fine.
Should these be specially delineated? Some kind of tag or attribute would make the check easy for ASCII-only applications.
If a consumer needs to understand unicode to handle the input specification anyway, it's easy for them to check what codepoints are used and filter elements; tagging ASCII-only ones specially seems like another form of redundancy that we don't want. It would be reasonable to provide an ASCII-only version of the data similar to the fontmap-combined one, though.
I don't entirely agree on the point that we "don't want" the redundancy; if anything, we are in a prime position to offer convenient redundancy. Providing an entirely separate ASCII file is a solution I can get behind, but I also do not see the problem with being redundant given that providing the data in a convenient manner is exactly the point. To even further that end, I would like to eventually get CI setup to automatically generate other data formats from the XMLs, which would be plainly redundant but useful all the same.
- Rework the structure of the
<lang>
tags. We replace all<name>
tags with either<ascii>
or<unicode>
,
It is also worth noting that the ASCII names are intended to be unique for the purposes for (de)tokenization in a text editor. Since we'd like to maintain the names used by SC and/or TokenIDE for reference, and we're already gonna make a <canonical>
tag, it should have an ASCII counterpart (and be identified as such).
Providing an entirely separate ASCII file is a solution I can get behind, but I also do not see the problem with being redundant given that providing the data in a convenient manner is exactly the point.
My concern about redundancy relates to difficulty of maintenance: if the same information is encoded two ways in our source data, it's more difficult to change it later. Beyond that I don't really care, which is why generating redundant data via some automatic transformation (CI-based releases, etc) seems like a fine approach.
the ASCII names are intended to be unique for the purposes for (de)tokenization in a text editor. Since we'd like to maintain the names used by SC and/or TokenIDE for reference, and we're already gonna make a
tag, it should have an ASCII counterpart
Having an obvious ASCII-only transformation only seems like it matters for Unicode-incapable editors consuming detokenized source code; otherwise the limited character set of (near-)ASCII is primarily for the convenience of humans to type things, in which situation having a canonical representation is unimportant as long as any given string can unambiguously map to a sequence of tokens.
I think my point here is that detokenizers should always prefer the (Unicode) canonical detokenization of things, but we can attempt to ensure that Unicode-incapable applications are still able to interoperate with full-featured ones by asserting that every token have at least one pure-ASCII representation. Maintaining compatibility with existing applications only matters inasmuch as we maintain the strings accepted by those existing tools as supported variants.
I suppose you might be more concerned about detokenized source from a new application being tokenizable by an old one, in which case labelling a "legacy canonical" encoding or something might be useful -- but SourceCoder consumes TokenIDE-style XML files so it wouldn't be difficult to make both of those tools handle the new canonical encodings.
I suppose you might be more concerned about detokenized source from a new application being tokenizable by an old one, in which case labelling a "legacy canonical" encoding or something might be useful -- but SourceCoder consumes TokenIDE-style XML files so it wouldn't be difficult to make both of those tools handle the new canonical encodings.
Sure, but I guess my point is that we don't "need" Unicode to be the standard for every use case. I always interpreted the approximations as being entirely for display purposes; ASCII token encodings, meanwhile, would remain the standard any time you need to type them in somewhere.
Having (de)tokenization run off Unicode by default makes things work much like TI-Connect CE, which has to have menus with copies of every symbol. SourceCoder and TokenIDE also have these menus, but they aren't strictly necessary once you know the encodings. Granted, you can also memorize the Unicode approximants, but ASCII is plainly easier to type.
The encodings used by SourceCoder and TokenIDE could nonetheless use some improvements. This has been discussed in very disparate places on Discord, and boils down to two main points: standardizing sigils for token categories (e.g. $
for stats vars) and how to differentiate RIGHT
(the token) from RIGHT
(the five tokens) in a user-friendly way. Said discussions should probably be ported into a separate issue.
But on that note, I hesitate to refer to the ASCII encodings as "legacy", as they simply don't need replacing.
Having (de)tokenization run off Unicode by default makes things work much like TI-Connect CE, which has to have menus with copies of every symbol.
No, because a user is free to use any version of a token to write it; I'm only proposing that detokenization prefer to use the Unicode representation, but there's nothing stopping a user from using ASCII versions of the same tokens when typing code.
If the issue is mostly around things being easy to type rather than strict limitation to ASCII, perhaps this is really a question of naming. How about an accessible
attribute that can be applied to any child of a token (canonical or alternate) indicating "this token should be easy for humans to type," and provides a hint to applications that they can prefer that version of a token in contexts where they want to detokenize to easily-typed representations.
It also occurs to me that we probably want to make a source-code representation of a program fixed to a single language which implies reorganizing the XML somewhat:
<token>
<canonical lang="en" chars="Dec">
<alternate accessible>>Dec</alternate>
</token>
Moving the (required) language code onto the canonical
element still allows the calculator representation for a given language to be known, but forces others to be language-agnostic (and tokenization would generally ignore non-English canonical representations). This is important because otherwise tokenizers would require a specified input language to select the correct language code to use. I don't believe anybody currently writes TI-BASIC source code in non-English languages, but I'm unaware of any other computer language that supports multilanguage syntax so it seems best to exclude that as an option here as well.
No, because a user is free to use any version of a token to write it; I'm only proposing that detokenization prefer to use the Unicode representation, but there's nothing stopping a user from using ASCII versions of the same tokens when typing code.
Detokenization into Unicode could look cryptic and misleading if you're attempting to learn the ropes by example. Making Unicode the canonical choice would make detokenizing from ASCII encodings a non-identity function. Given that I know of very few serious programming languages that use non-ASCII for syntax, this feels like just a bad move. If there weren't good reasons to play favorites on our end (mostly just having there be a standard), I would genuinely want neither to be the truly "canonical" encoding, and instead simply offer both and let the application decide. But, push coming to shove, ASCII should be preferred.
If the issue is mostly around things being easy to type rather than strict limitation to ASCII, perhaps this is really a question of naming. How about an accessible attribute that can be applied to any child of a token (canonical or alternate) indicating "this token should be easy for humans to type," and provides a hint to applications that they can prefer that version of a token in contexts where they want to detokenize to easily-typed representations.
This feels extremely unnecessary and similarly cryptic. What, to us, is "accessible", if not ASCII or some other standard? It's just beating around the bush.
Moving the (required) language code onto the canonical element still allows the calculator representation for a given language to be known, but forces others to be language-agnostic (and tokenization would generally ignore non-English canonical representations).
I can get behind moving at least one (of each) encoding into a language agnostic section, though this could also be resolved by simply assuming English to be the default.
This issue feels like it has fallen off the rails a bit; I'll attempt to refocus it by clarifying some things about the token sheets here.
My initial intention for the tokens database was that it shouldn't have any particularly strong opinion about a "canonical" detokenization- this is entirely up to the application using the sheets. It merely provides a list of reasonable options, ideally allowing you to request an option that suits your needs, i.e, a printable-ascii name, or as-close-as-you-can-get-with-unicode-to-what-you-see-on-calc name, or (as was suggested here, if I'm understanding things correctly) a sequence of TI-Font bytes. Of course, projects evolve past their creator's intentions, but I still believe this better reflects TI-Toolkit's broader goal of improving documentation and tooling than the alternative.
Obviously the current solution to this end is lazy and particularly uninspired, but I think this leans closer to what @kg583 was suggesting in https://github.com/TI-Toolkit/tokens/issues/12#issuecomment-1548930276.
EDIT: wait, this was a bad take on my part, but I'm falling asleep now, I'll fix it right when I wake up Edit: I do not remember my better take >.>
I don't believe anybody currently writes TI-BASIC source code in non-English languages
French is definitely a popular ti-basic language used since it's the default on all the French calcs (82A and 83PCE, for the recent models), which is why tivars_lib_cpp makes sure it can tok/detok in both English and French. It only cares about Unicode (with its own token file upstreamed here ) but I'm following this issue as I may change this eventually....
Also, I'd really like us to end up agreeing on something here, even if it takes a bit more time to discuss all this, because it does seem like a big improvement if "all" recent/modern community tooling for all this ends up using a unique, centralized, and maintained tokens "database" covering all needs, rather than each using its own thing and users being confused as to why something works on one tool and not on another.
Personally, I believe Unicode should be used everywhere just because it makes reading so much easier. And being able to type things correctly "just" (with big quotes) becomes an interface issue that needs to be solved with a great UX otherwise the user will get frustrated (for the tiplanet PB, I've been thinking of having a mix of what current editors do, with both having a catalog/categories pane where you can pick tokens to insert, and having the ability to type tokens via Unicode directly (why not) or ascii that gets automatically replaced by the correct Unicode match. This is still a thought-in-progress, though
French is definitely a popular ti-basic language used since it's the default on all the French calcs (82A and 83PCE, for the recent models)
Okay, so I guess we'd want to retain the existing lang->strings hierarchy in that case- doesn't seem like a big deal.
having the ability to type tokens via Unicode directly (why not) or ascii that gets automatically replaced by the correct Unicode match
I had a thought like this too; a fancy autocomplete system that lets you type easily and converts to tokens on the fly seems ideal. The goal of the data should be to enable that sort of thing while still supporting less fancy tools and allowing them all to interoperate. This means providing a representation of each given token that closely matches what a calculator displays (the canonical one expressed in terms of the calculator charset) and zero or more aliases that may be easier to type.
For the benefit of applications that may want to prefer easily typed tokens, it seems reasonable to offer my proposed accessible
version, but taking no particular position on what charset that actually is (which is why I'm against calling it ascii
: this is about being accessible, not choosing a given character set). Although all accessible variants may end up being pure ASCII, that would require further discussion to arrive at.
Re: @kg583:
This feels extremely unnecessary and similarly cryptic. What, to us, is "accessible", if not ASCII or some other standard? It's just beating around the bush.
TI-ASCII is a term that exists and is similar enough in name and form to standard ASCII to be easily conflated. Instead of "accessible", I suggest the admittedly somewhat clumsy "typeable" or "typable" (the latter of which autocorrect and the OED dislike but Wiktionary and MW list as an alternate spelling), which even more precisely captures what we want from these entries.
Re: @adriweb, on the topic of only having one Unicode translation: A further goal that should be considered is having mappings for all of the existing/old token sheets that were in use before this token sheet (obviously, within reason- though no totally unreasonable mappings exist to my knowledge). I think this would save lots of potential headaches when it comes to programs saved in old forum posts, etc. Multiple Unicode representations exist for every token; while we can (and in my opinion, should) select a favored one, this token sheet is for more than just editors.
We should decide on something actionable sooner rather than later so we can actually, y'know, use this thing.
I think each lang
must have a required canonical
name for the TI-Font representation, a required unique typeable
or accessible
(we need to decide on the name of the tag still) name for the common keyboard-input representation, a unique required preferred
name for an as-close-as-possible Unicode interpretation, and any number of optional variant
alternative Unicode names. Each token should have at least en
and fr
language support and optionally other languages too (following ISO 639-1 for the names).
I think each
lang
must have a required canonical name for the TI-Font representation, a required uniquetypeable
oraccessible
(we need to decide on the name of the tag still) name for the common keyboard-input representation, a unique requiredpreferred
name for an as-close-as-possible Unicode interpretation, and any number of optionalvariant
alternative Unicode names. Each token should have at leasten
andfr
language support and optionally other languages too (following ISO 639-1 for the names).
I second this, though favor accessible
over typeable
for exactly the reasons you stated. I flip-flop about canonical
as well, since it doesn't describe what it is very well (i.e. the list of bytes in the font map that the token uses). preferred
is fine I suppose, with my reservations mostly the same as accessible
in that it's an "opinionated" adjective.
To try to sum up my position about the accessible
tag at the moment:
My motivation for being so adamant about having ASCII-only names stems from typing TI-BASIC into token editors, something I've done for years. I would say that typeability is probably the biggest boon of ASCII representations, but this doesn't mean I support ASCII names solely for this reason, and the name of the tag should not reflect only that reason either.
ASCII is more typeable, more likely to be supported by a custom font, and more recognizable. I don't necessarily hold any of these reasons in higher regard than any other for the purposes of inclusion in this standard; I simply recognize that any of those are reasons, and thus we should be purely descriptive with the tag name.
accessible
indicates that all we care about is some kind of accessibility; typeable
is even more narrow. But the name ascii
says what it is and nothing more. I don't care why you want ASCII. I don't care what you do with it. But I know some people would like to have it, so here it is. Calling it anything else masks the intention (or at least my own intention; there is obviously some divide on that).
I like "canonical", but I guess that implies there's only 1 canonical one (and everything else is alternatives). However if we're not going with a unique ascii or Unicode equivalent, then maybe we can just consider that the list is ordered, by preferred alternatives, descending (most preferred, implicitly "canonical", first)
I'm also fine with "ascii".
There is only one canonical TI-Font representation for any given token in any given translation?
On Sat, Jun 10, 2023 at 12:36 PM Adrien Bertrand @.***> wrote:
I like "canonical", but I guess that implies there's only 1 canonical one (and everything else is alternatives). However if we're not going with a unique ascii or Unicode equivalent, then maybe we can just consider that the list is ordered, by preferred alternatives, descending (most preferred, implicitly "canonical", first)
I'm also fine with "ascii".
— Reply to this email directly, view it on GitHub https://github.com/TI-Toolkit/tokens/issues/12#issuecomment-1585730049, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2QMYGNKPAUJ7UAQENAZXEDXKSPB3ANCNFSM6AAAAAAXI3ZKNY . You are receiving this because you commented.Message ID: @.***>
Let me try to bring everything back together, because I think we are all converging on a reasonable solution.
There are four fields that should be provided per token, per language. These fields will remain unnamed in this proposition, as this is one of the main points of contention for the fomat. We seek to use these fields to define the standard representations of tokens for various applications.
There are plenty of other motivations for these fields, and I think everyone in the discussion so far can get behind some inclusion of all the data mentioned. In addition,
We still need to decide on the names of these fields. The current suggestions are
canonical
, chars
, value
(if used as a tag attribute)preferred
, unicode
accessible
, ascii
, typable
, typeable
alternate(s)
, variant(s)
(pluralized if the field is a list rather than spanning multiple tags)My personal choices would be, reflecting my desire to be as purely descriptive as possible,
canonical
unicode
ascii
Other miscellaneous points to be addressed:
This proposition is a hopeful summary of the discussion here and on Discord thus far that aims to reconcile the disagreements and misunderstandings earlier. If there is a important concern that still wasn't addressed here, please say as much. A functional format that is unanimously tolerable is the top priority.
I think I agree with that general structure. I think field B should be automatically generated from the font map and value of field A, simply because that's easy to do automatically and makes everything easier to maintain.
Just to bikeshed the namings:
value
, but chars
is okay even if I prefer canonical
unicode
seems like a bad name because it's too generic: every textual element is unicode so it seems wrong to pick out one field as a special kind of unicode. Alternate proposal: display
to indicate it's how a calculator would display the token.unicode
, ascii
is a bad name because many tokens are ASCII even in their "unicode" (field B) form. I still think accessible
is the best of these, but there might be an interesting question around policy for defining values of this field which could inform its name: do we assume a particular keyboard layout for a given language such that the English tokens for example only contain characters that are present on a en-US keyboard even though other languages would be able to use a different character (since a French keyboard for instance typically has an AltGr and dead keys that allow accented characters to be typed easily)?variant
, but either seems okay.I think field B should be automatically generated from the font map and value of field A, simply because that's easy to do automatically and makes everything easier to maintain.
Yes, this is what I was implying with the second bullet point after the list of fields.
I don't like
value
, butchars
is okay even if I prefercanonical
canonical
just feels so non-descriptive of what is actually in the field, and to that end I similarly prefer chars
over value
. Hell, the term canonical
proved confusing just a few comments up in this thread.
Alternate proposal:
display
to indicate it's how a calculator would display the token.
I'll second this (and @rpitasky third'd on Discord).
As with
unicode
,ascii
is a bad name because many tokens are ASCII even in their "unicode" (field B) form. I still thinkaccessible
is the best of these, but there might be an interesting question around policy for defining values of this field which could inform its name: do we assume a particular keyboard layout for a given language such that the English tokens for example only contain characters that are present on a en-US keyboard even though other languages would be able to use a different character (since a French keyboard for instance typically has an AltGr and dead keys that allow accented characters to be typed easily)?
I think I came off too strongly about the typeability point, even if its maybe the best motivation for the field. It's really just a matter of how ASCII representations pervade the token editors, and they in turn inform how code gets shared over text, so we should absolutely keep them. I suppose you could call these "accessibility" reasons, but I again think that this misses the point. The most important fact about the field is that it is ASCII; any other name implies we could deviate if we wanted to. I furthermore have no idea why some display
names being purely ASCII is a concern; there's no if-and-only-if here and I don't think anybody is expecting one. If fields B and C happen to agree for some tokens, so be it.
Yeah, I give display
my metaphorical rubber stamp, as well as autogenerating things. How about tifont
for the tifont bytes?
Regardless, the conversations we have had here and in Discord and in Matrix are barely documentation; whatever we choose must be documented clearly in at least the README (I recognize this probably goes without saying, but it makes debates over whether something captures all of the requisite meaning a little less important).
How about
tifont
for the tifont bytes?
I dislike the use of the term "font" to refer to a series of encoded bytes, since a font defines the display of glyphs that are already encoded in a particular way, and this is completely unrelated to that. We should name the field after the encoding scheme that tifont and the encoded token names use. Of course, this is just kicking the can down the road a bit, as it's very unlikely there's an official name for it (the TI-83 SDK docs mention "extended ASCII," but it's in the glossary and it seems likely that this is referring to the part where they describe the input format for the assembler, rather than the actual character set used on the calculator) and we would need to come up with our own. "TI-ASCII" seems to be relatively common, but I dislike that term too, since it's not a superset of ASCII. We could probably just call it something boring like ti-calc-encoding
or ti83-encoding
.
If Commodore can have PETSCII, how about 8xSCII?
TISCII?
calc-encoding
or calc-encoded
?
ti-ascii
; are we good with this here?I'm fine with ti-ascii
, though I feel now there should be a separation between encoding(s) and names in the format, so that we can better communicate the "data type" of the fields in-place and not worry about bad name clashes like ti-ascii
vs. ascii
, which refer to completely different types.
Thus cometh a proposition where I, kg583, endorse the use of accessible
:
<lang code="en">
<encodings>
<ti-ascii>Field A</ti-ascii>
<!-- are there any other viable encodings?
cause if not this does seem maybe a tad silly -->
</encodings>
<names>
<display>Field B</display>
<accessible>Field C</accessible>
<variant>Field D</variant>
<!-- more variants as need be -->
</names>
</lang>
Also, its worth noting that the "correct" way to specify the encoding, an XML string with explicit bytes where necessary, would actually motivate something like <ti-ascii value=""></ti-ascii>
.
The above proposal (https://github.com/TI-Toolkit/tokens/issues/12#issuecomment-1586514583) seems reasonable in general, but there's a little bit of complexity that I think can be omitted.
I think if we're unable to identify any other relevant encodings, the value of Field A could just as easily be placed in an attribute of the lang
element; additional encodings could always be added in a new schema version if a reason were found to include them.
<lang code="en" ti-ascii="Field A">
<display>Field B</display>
<accessible>Field C</accessible>
<variant>Field D</variant>
<!-- additional variants if applicable -->
</lang>
This doesn't change any of the names (though it does drop the word encoding
) and reduces nesting which makes the XML a little easier to parse.
I suppose we could take the attribute-ification even further and convert all of the non-repeatable elements into attributes of lang
which doesn't sacrifice any expressiveness and is even easier to parse.
Even if it's not entirely relevant, I wanted to mention that Field A has language-specific encodings(?) as some language apps change specific characters (tested on the CE). For example, in Swedish, 090h, English â
, is changed to å
.
Is that actually field A, or is it field B? I take your comment to mean that the character 90h is changed, which leaves field A unchanged but B becomes different.
Field B would be different than if it was generated purely from the English TI-ASCII encoding and Field A. I was meaning more "Field A would not necessarily be 1:1 with the other fields" which I wanted to mention in case it was assumed that it would be 1:1 from the only documentation really existing in English.
Field A is the calculator characters though, and I don't think the normal character set has an å at all- it seems like you're saying the Swedish localization changes the character set so the glyph corresponding to character 0x90 is different but the character data is unchanged. That would mean field A is unchanged in every case, but field B is different between Swedish and other languages.
I've written a small tool to extract translations from the app files; I'll probably give it a home in this org somewhere once we decide where, but here's French, Spanish, Portuguese: https://gist.github.com/rpitasky/5daa6eb4090fb4b1c9360e0eb2404ce6
A "None" means the token inherits from the default (English) string.
I'd like to get moving on this, I'm willing to adopt #12 (comment).
Does anyone have the TI font encodings for the English tokens handy? Are we sure the translation table above (and particularly the TokenIDE sheets) are faithful?
It's pretty easy for me to generate the TI-ASCII encodings from the mapping I defined and an XML file, where my Tokens.xml is intended to provide this function but hasn't been checked for any kind of accuracy and may be missing some newer (CE) tokens.
The general consensus on discord was that a fresh extraction would be the best; I've gone ahead and done this (this comes from OS 5.3.0.0037):
Beware that they need to be reordered to be inserted into the tokens sheet, they are only somewhat in a reasonable order now.
In one of my projects (not currently published anywhere), I've wanted to define canonical token representations that use the full range of Unicode to display text as closely as possible to how it appears on a calculator while also preserving semantics.
I've been working with a tokens XML file which usefully specifies
<alt>
strings for tokens so it's easy to treat the non-alternate string for each token as canonical, and I've developed this mapping from Unicode strings to calculator character set based on the combination of token semantics and actual use of characters on monochrome calculators:Although many of the Unicode versions are a single character, a few calculator characters have no direct equivalent in Unicode so are represented with a sequence of characters (such as character 0x1d), and for a few symbols that are used mainly as graphical elements (not appearing in any tokens) I've opted to use a portion of a Unicode private use area at U+F8300..U+F83FF.
The corresponding token definition changes are fairly minor, mostly ensuring that the canonical version of each token uses the semantically-correct Unicode version, like replacing many lower-case
w
s with U+1D5D0. Here's the diff of my XML to adopt these mappings (though I'm not certain that alone is all the changes I made): tokens.patch.txtIn my application there's also validation that the canonical string for each token can be displayed entirely with the calculator character set, which is fairly simply validation but seems like the most important property of having a canonicalization.
I suggest that this token format should have a way to specify a string version of each token that matches a defined unicode-to-calculator correspondence, and propose the one I've developed as that correspondence.