Character frequency groups

donkirkby / pinyincushion

Chinese learning tool for editing simple texts with pronunciation guide

https://donkirkby.github.io/pinyincushion/

1 stars 0 forks source link

Character frequency groups #2

Closed donkirkby closed 8 years ago

donkirkby commented 8 years ago

Use the character frequency data to decide whether to highlight each character that the user types. Let them choose what frequency to start highlighting: 100, 500, 2000, 5000, maybe?

[x] color character background by frequency
[x] draw a legend
[x] ~~try colouring the editor instead of the display~~ split out to issue #10.

donkirkby commented 8 years ago

To determine whether to highlight a character, we need to know if it's Chinese, English, punctuation, or something else. In Java, I used Unicode code blocks. There might be something similar in Javascript.

donkirkby commented 8 years ago

Sorry, that's not quite what I was thinking of. You're displaying the character frequency within the text. What I think would be useful for teachers would be to highlight characters that are outside students' vocabulary. If a teacher estimates that the average student knows roughly 500 characters, they will select "top 500" as the allowable characters. Any character with lower frequency gets highlighted, so the teacher can think about a different way to say it with more common characters.

It's kind of like the highlighting on easypronunciation.com, and a Chinese version of the highlighting on XKCD's simple writer.

zyxue commented 8 years ago

Sorry, I misunderstood. Where can I find the database for determining which character lies in which frequency band then? Maybe this information can be incorporated into the pronunciation database, too?

donkirkby commented 8 years ago

You can download the frequency data from chtsai.org. I thought about putting it in the same structure as the pronunciation, but I think it might be more efficient to just store a string for each frequency band. If you want to put it in the same file with the pronunciation data, that's fine with me. Just change the name to chardata.js or something. If we want to get fancy, we could sort each frequency band and use binary search, but let's leave that for later. It might not even be faster than a single call to indexOf().

donkirkby commented 8 years ago

As an example, here's an equivalent structure for the frequency bands of English letters:

var frequencies = {5: 'etaoi', 10: 'nshrd'};

To find the frequency band of a letter, look to see if it's in each band using indexOf(). If it's not in any of the top bands, it must be in the lowest band (the one we don't store). Then you highlight all characters that are in lower frequency bands than the user selected.

donkirkby commented 8 years ago

I'd like to try something a bit different. XKCD's Simple Writer highlights the low-frequency words in the editor, and I think that's much easier. When I tried to simplify a text in our editor, I had to look in the display to find low-frequency words, then find where to change those words in the editor. It looks like XKCD uses the Ace editor, so maybe we can try that.

donkirkby commented 8 years ago

The remaining tasks are in issue #10, so I'm closing this one.