Open StoneCypher opened 3 years ago
The conversion from codepoint to character string is done like this:
[code_point].pack('U*')
This will create the correct string representation for any Unicode codepoint. So as long as the entity consists of a single code point, this will work.
Does that clear it up?
I apologize. That isn't what I meant.
ENTITY_TABLE = [
[913, 'Alpha'],
[914, 'Beta'],
[915, 'Gamma'],
...
[213, 'Otilde'],
[214, 'Ouml'],
[215, 'times'],
Please pretend for a moment that there was no dedicated capital-O umlaut Ö
character. There is, of course; it's U+00D6
, represented here as decimal 214. But let's pretend there wasn't.
In Unicode, there is a dedicated combining diaresis, and you can attach it to other characters to construct the character you need. As such, you could make the character with capital O O
U+004F
then combining diaresis ◌̈
U+0308
. We prefer the pre-combined O because fonts trying to typeset symbols above letters typically do a bad job, and sorting is a nightmare, and etc, but, you can actually have an umlaut over whatever, including the poop emoji, if you really want to.
So for a moment, pretend please that I want to rewrite your Ouml rule to emit two codepoints, and construct the Ö instead of using the real one. In this case it's silly, but this is legitimately how quite a few entities (particularly in math) are written. By example, ⫅̸
- Not subset-equal
- is written as U+2288
, the dedicated math symbol, but really should be written as U+10949
subset equal U+338
negating slash (the logic symbol) instead.
And that's hard to think about, so we're lying, and talking about O umlaut.
If for some stupid reason I wanted to emit U+004F
U+0308
for Ouml
in this table, how would I do it?
I see. This is not possible with how the entities are implemented in kramdown though it is easily doable by just doing [code_point1, code_point2].pack('U*')
.
As far as I can see, however, all the HTML5 entities are just single-codepoint entities? So this should not be a problem here.
Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.
There are a few.
Name | Symbol | Codepoint |
---|---|---|
ncongdot | ⩭̸ | U+2A6D (10861), U+0338 (824) |
nleqslant, nles, NotLessSlantEqual | ⩽̸ | U+2A7D (10877), U+0338 (824) |
ngeqslant, nges, NotGreaterSlantEqual | ⩾̸ | U+2A7E (10878), U+0338 (824) |
There are 65 other than these three.
Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.
❤️ ❤️ ❤️
Thank you
Some HTML entities, such as
nsubE
, are represented as multiple unicode characters (in this caseU+2AC5
U+0338
.) This is particularly common in math symbols using the slash to strike through symbols.It is not immediately clear to me how to represent that in the kramdown entity list.
If you could tell me how to represent that one case please, I would happily extend it to the remainder.