XCF-Babble / babble

说都不会话了。
GNU General Public License v3.0
66 stars 9 forks source link

More default bases in different languages #46

Open yvbbrjdr opened 5 years ago

yvbbrjdr commented 5 years ago

This thread is to discuss and propose new default bases.

yvbbrjdr commented 5 years ago

Japanese

Relative Ratio Character Frequency List

Due to insufficient Hiragana and Katakana, we can include all 71 Hiragana and Katakana, and 114 Kanji.

Proposed Base: あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをんがぎぐげござじずぜぞだぢづでどばびぶべぼぱぴぷぺぽアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲンガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ日一国会人年大十二本中長出三同時政事自行社見月分議後前民生連五発間対上部東者党地合市業内相方四定今回新場金員九入選立開手米力学問高代明実円関決子動京全目表戦経通外最言氏現理調体化田当八六約主題下首意法不来作性的要用制治度務強気小

kkuehlz commented 5 years ago

Here is a link to the frequency chart for Korean characters. The nice thing about Korean is multiple syllables (each one unicode codepont) can become one character (one unicode codepoint). If we are clever about how we layout the base we might be able to shorten encoding length. Decomposing from the composed codeblock will also be tricky...

Thanks @mdcha!

http://nlp.kookmin.ac.kr/data/syl-2.txt

yvbbrjdr commented 5 years ago

Appreciate it @mdcha!

Proposed Korean Base: 이다의는에을하한고가로기지사서은도를대정리자수시으있어구인나제국과그해전부것일적아연라성들상원여보장화주소동공조스경계용위우게학만개면되관문유선중산치신회발비분생내방무와세니물등할실통었미모러업교체진재안야명민간며단당요년거마금된오본했법합식없각였결영행때데력반설터려속운양현차종말형음술석바입역임않작히및건질표외강두까백권트르직불호심따처타태출파천남람던점감저난후포또특최크달예같능변북드프래책김노함박배추환열평증매울품약집군향근알초온급목더료른론확준토록활련격월광판키청습험번절류규루복량많피새레응받령란날편

kkuehlz commented 5 years ago

We will also need to use this library to decompose hangul characters into tamo before decrypting. It is pretty lightweight.

kkuehlz commented 5 years ago

After some thinking, I do believe that emoji support is possible, although it will require a great deal of enigneering effort. We would need to redesign the website API to support special decoding, and each site would need its own emoji translation table. There are well over 256 emojis. To make this as site agnostic as possible, we would want to choose 256 standard emojis in our base so there is some valid translation on every site.

Let's first take a look at Slack. For a simple emoji, such as :smile:, it will be the following.

<span class="c-emoji c-emoji__medium c-emoji--inline" data-qa="emoji" delay="300" aria-describedby="slack-kit-tooltip">
  <img src="https://a.slack-edge.com/production-standard-emoji-assets/10.2/google-medium/1f642.png" aria-label="slightly smiling face emoji" alt=":slightly_smiling_face:" data-stringify-type="emoji" data-stringify-emoji=":slightly_smiling_face:">
</span>

Luckily, the alt accessibility text makes our lives a lot easier and serves at the key to the translation table. It can be fetched programatically by running the following:

document.querySelectorAll('[class=c-message__body]')[20].querySelectorAll('[data-stringify-type=emoji]')[0].alt

The downside here is that we would need to introduce an HTML parser into the decoding pipeline, since Slack uses an img wrapped in a span for every emoji that will need to be erased.