KamWithK / exSTATic

Zero effort language learning reading tracker with graphs and stats
GNU General Public License v3.0
97 stars 7 forks source link

Some Japanese typographic symbols are counted while others are not #27

Open pinntokuru opened 5 months ago

pinntokuru commented 5 months ago

In calculations.js the ignore variable contains a list of typographic symbols to ignore for character counting purposes. I've found two relatively common characters, the fullwidth full stop . and the katakana middle dot ・ that are not in this list. The middle dot is also in the Wikipedia page that the code refers (https://en.wikipedia.org/wiki/List_of_Japanese_typographic_symbols).

I'm sure there are many other characters that can show up that this list does not cover, and adding them all one by one is not very feasible. In that case I think going the other way and having a regex allow list is a better idea.

I went through unicode blocks of Japanese and Roman unicode blocks and came up with a set of ranges. The blocks contain special marks as well as characters that should be counted, so instead of using the entire block I took only the parts that should count as characters.

Hiragana U+3041 to U+3096

Katakana U+30A1 to U+30FA

Numbers U+FF10 to U+FF19

Roman Uppercase Letters U+FF21 to U+FF3A

Roman Lowercase Letters U+FF41 to U+FF5A

Half-width Katakana (not sure if should be included) U+FF66 to U+FF9D

CJK unifed ideographs - Common and uncommon kanji: U+4E00 - U+9FAF

CJK unified ideographs Extension A - Rare kanji: U+3400 to U+4DBF

Or, another idea might be to make it so the user can provide their own characters or regex to match in the settings page?

KamWithK commented 3 months ago

Hmmm you do raise an interesting point here

I never thought about this

Sometimes there are little bits of english though too, which I would argue should be there as well

What if someone wanted to use it for Koraen or Chinese or some other language though? The solution should be flexible enough that it doesn't completely block them from doing so