gettalong / kramdown

kramdown is a fast, pure Ruby Markdown superset converter, using a strict syntax definition and supporting several common extensions.
http://kramdown.gettalong.org
Other
1.72k stars 275 forks source link

It is not clear how to write a multipoint entity in your entity list #737

Open StoneCypher opened 3 years ago

StoneCypher commented 3 years ago

Some HTML entities, such as nsubE, are represented as multiple unicode characters (in this case U+2AC5 U+0338.) This is particularly common in math symbols using the slash to strike through symbols.

It is not immediately clear to me how to represent that in the kramdown entity list.

If you could tell me how to represent that one case please, I would happily extend it to the remainder.

gettalong commented 3 years ago

The conversion from codepoint to character string is done like this:

[code_point].pack('U*')

This will create the correct string representation for any Unicode codepoint. So as long as the entity consists of a single code point, this will work.

Does that clear it up?

StoneCypher commented 3 years ago

I apologize. That isn't what I meant.

      ENTITY_TABLE = [
        [913, 'Alpha'],
        [914, 'Beta'],
        [915, 'Gamma'],

...

        [213, 'Otilde'],
        [214, 'Ouml'],
        [215, 'times'],

Please pretend for a moment that there was no dedicated capital-O umlaut Ö character. There is, of course; it's U+00D6, represented here as decimal 214. But let's pretend there wasn't.

In Unicode, there is a dedicated combining diaresis, and you can attach it to other characters to construct the character you need. As such, you could make the character with capital O O U+004F then combining diaresis ◌̈ U+0308. We prefer the pre-combined O because fonts trying to typeset symbols above letters typically do a bad job, and sorting is a nightmare, and etc, but, you can actually have an umlaut over whatever, including the poop emoji, if you really want to.

So for a moment, pretend please that I want to rewrite your Ouml rule to emit two codepoints, and construct the Ö instead of using the real one. In this case it's silly, but this is legitimately how quite a few entities (particularly in math) are written. By example, ⫅̸ - Not subset-equal - is written as U+2288, the dedicated math symbol, but really should be written as U+10949 subset equal U+338 negating slash (the logic symbol) instead.

And that's hard to think about, so we're lying, and talking about O umlaut.

If for some stupid reason I wanted to emit U+004F U+0308 for Ouml in this table, how would I do it?

gettalong commented 3 years ago

I see. This is not possible with how the entities are implemented in kramdown though it is easily doable by just doing [code_point1, code_point2].pack('U*').

As far as I can see, however, all the HTML5 entities are just single-codepoint entities? So this should not be a problem here.

Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.

StoneCypher commented 3 years ago

There are a few.

Name Symbol Codepoint
ncongdot ⩭̸ U+2A6D (10861), U+0338 (824)
nleqslant, nles, NotLessSlantEqual ⩽̸ U+2A7D (10877), U+0338 (824)
ngeqslant, nges, NotGreaterSlantEqual ⩾̸ U+2A7E (10878), U+0338 (824)

There are 65 other than these three.

StoneCypher commented 3 years ago

Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.

❤️ ❤️ ❤️

Thank you