Closed dper closed 10 years ago
This is done.
Right now there are no safeguards against bad input ... the program will of course error out ... but it would be cool to detect non-kanji characters and either ignore them or display a warning. But that would be a separate issue in any case.
I've just did what you mentioned above (filter out non-kanji characters) by using regular expressions. There is still an issue with multi kanji compound words, though (since each kanji is matched to a dictionary entry separately). What about adding each entry to be looked in a separate line or using commas to separate entries?
I can submit a pull request later if you are interested in adding this to the script.
@jfsantos I also added some filters for non-kanji a while back.
def remove_unwanted_characters characters
characters = characters.gsub(/[[:ascii:]]/, '')
characters = characters.gsub(/[[:blank:]]/, '')
characters = characters.gsub(/[[:cntrl:]]/, '')
characters = characters.gsub(/[[:punct:]]/, '')
return characters
end
As for what you mean about multi-kanji compounds, I'm confused. The script makes flash cards where the front of the card is supposed to be just one single kanji, not a jukugo.
If your goal is flash cards for jukugo, you can do that, but you'd want to change other aspects of the cards, too. Stroke order and grade level would make no sense, and you'd have a harder time making a list of examples. The script itself looks up kanji in kanjidic
, which only indexes single kanji, so this kind of thing would require major revision.
You are absolutely right about jukugo, the card structure would have to be changed.
But regarding the filters, I added the following lines to drop Hiragana and Katakana and remove duplicates (after all your filters and before the return
statement):
characters = characters.gsub(/([\p{Hiragana}\p{Katakana}]+)/, '')
characters = characters.chars.uniq.join()
That way, the use case where you want to generate a deck for all the kanji in a given text file is supported (except for the jukugo). There is a regular expression called \p{Han} too, but I believe it would match all CJK characters and not only kanji.
That's marvelous! I had no idea those regular expressions existed. Will add them here, too.
If you're looking at learning words instead of standalone characters, I've done some work on that (and make use of the script on a regular basis). https://github.com/dper/kanjiscripts. I like flash cards with sentences. Memorizing lists of words on their own is OK, but the real goal is to learn to read, so I prefer words in short sentences.
The kanji we want to make Anki entries for should be specified in a text file in the current directory. The script should read from that file, which should be a file consisting of only kanji with nothing else (except maybe white space or line breaks that we can remove easily enough).