Read kanji list from text file

dper / kanjiforanki

Takes a list of kanji and generates Anki flash cards for each them.

MIT License

11 stars 0 forks source link

Read kanji list from text file #1

Closed dper closed 10 years ago

dper commented 10 years ago

The kanji we want to make Anki entries for should be specified in a text file in the current directory. The script should read from that file, which should be a file consisting of only kanji with nothing else (except maybe white space or line breaks that we can remove easily enough).

dper commented 10 years ago

This is done.

Right now there are no safeguards against bad input ... the program will of course error out ... but it would be cool to detect non-kanji characters and either ignore them or display a warning. But that would be a separate issue in any case.

jfsantos commented 10 years ago

I've just did what you mentioned above (filter out non-kanji characters) by using regular expressions. There is still an issue with multi kanji compound words, though (since each kanji is matched to a dictionary entry separately). What about adding each entry to be looked in a separate line or using commas to separate entries?

I can submit a pull request later if you are interested in adding this to the script.

dper commented 10 years ago

@jfsantos I also added some filters for non-kanji a while back.

    def remove_unwanted_characters characters
        characters = characters.gsub(/[[:ascii:]]/, '')
        characters = characters.gsub(/[[:blank:]]/, '')
        characters = characters.gsub(/[[:cntrl:]]/, '')
        characters = characters.gsub(/[[:punct:]]/, '')
        return characters   
    end

As for what you mean about multi-kanji compounds, I'm confused. The script makes flash cards where the front of the card is supposed to be just one single kanji, not a jukugo.

If your goal is flash cards for jukugo, you can do that, but you'd want to change other aspects of the cards, too. Stroke order and grade level would make no sense, and you'd have a harder time making a list of examples. The script itself looks up kanji in kanjidic, which only indexes single kanji, so this kind of thing would require major revision.

jfsantos commented 10 years ago

You are absolutely right about jukugo, the card structure would have to be changed.

But regarding the filters, I added the following lines to drop Hiragana and Katakana and remove duplicates (after all your filters and before the return statement):

characters = characters.gsub(/([\p{Hiragana}\p{Katakana}]+)/, '')
characters = characters.chars.uniq.join()

That way, the use case where you want to generate a deck for all the kanji in a given text file is supported (except for the jukugo). There is a regular expression called \p{Han} too, but I believe it would match all CJK characters and not only kanji.

dper commented 10 years ago

That's marvelous! I had no idea those regular expressions existed. Will add them here, too.

dper commented 10 years ago

If you're looking at learning words instead of standalone characters, I've done some work on that (and make use of the script on a regular basis). https://github.com/dper/kanjiscripts. I like flash cards with sentences. Memorizing lists of words on their own is OK, but the real goal is to learn to read, so I prefer words in short sentences.