Open pinntokuru opened 5 months ago
Hmmm you do raise an interesting point here
I never thought about this
Sometimes there are little bits of english though too, which I would argue should be there as well
What if someone wanted to use it for Koraen or Chinese or some other language though? The solution should be flexible enough that it doesn't completely block them from doing so
In calculations.js the ignore variable contains a list of typographic symbols to ignore for character counting purposes. I've found two relatively common characters, the fullwidth full stop . and the katakana middle dot ・ that are not in this list. The middle dot is also in the Wikipedia page that the code refers (https://en.wikipedia.org/wiki/List_of_Japanese_typographic_symbols).
I'm sure there are many other characters that can show up that this list does not cover, and adding them all one by one is not very feasible. In that case I think going the other way and having a regex allow list is a better idea.
I went through unicode blocks of Japanese and Roman unicode blocks and came up with a set of ranges. The blocks contain special marks as well as characters that should be counted, so instead of using the entire block I took only the parts that should count as characters.
Hiragana U+3041 to U+3096
Katakana U+30A1 to U+30FA
Numbers U+FF10 to U+FF19
Roman Uppercase Letters U+FF21 to U+FF3A
Roman Lowercase Letters U+FF41 to U+FF5A
Half-width Katakana (not sure if should be included) U+FF66 to U+FF9D
CJK unifed ideographs - Common and uncommon kanji: U+4E00 - U+9FAF
CJK unified ideographs Extension A - Rare kanji: U+3400 to U+4DBF
Or, another idea might be to make it so the user can provide their own characters or regex to match in the settings page?