Open RichardLitt opened 9 years ago
Here’s a brain dump of projects that might be good to grab data from or that can serve as inspiration. Some of these are in use in Jisho and others I’m looking at including.
My main knowledge is around Japanese so I know very few Chinese resources and don’t have any for Korean or Vietnamese unfortunately.
If I think of anything else I’ll add it to this thread. Also feel free to ask anything about Jisho.
http://www.edrdg.org/kanjidic/kanjd2index.html
Jim Breen’s Japanese kanji database file. Accessible through Jisho.org or Jim's WWWJDIC. Contains a lot of data.
http://www.csse.monash.edu.au/~jwb/kradinf.html
Also from Jim Breen. Data file that breaks down kanji into radicals/components.
http://shop.oreilly.com/product/9780596514471.do
Not a site, but this is THE book on how to work with CJKV characters.
Database file with stroke and shape information for kanji. This powers the stroke order diagrams on Jisho.org.
http://en.wiktionary.org/wiki/家
Wiktionary contains a lot of kanji/hanzi/hanja data and etymology.
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=91D1&useutf8=true
Unicode’s database of all han characters.
http://kanji-database.sourceforge.net/index.html
A multitude of database files with data for kanji/hanzi/hanja. I just discovered this recently and haven’t had time to explore it much, but it looks nice. Data is on GitHub: https://github.com/cjkvi
http://en.glyphwiki.org/wiki/GlyphWiki:MainPage
Wiki with glyph information for a multitude of kanji/hanzi/hanja. Contains glyphs that are not in Unicode.
https://github.com/mifunetoshiro/kanjium
A fork and expansion of Kanjidic2/Radkfile.
http://www.chise.org/index.html.en
This is supposed to be a repository of character information, but I don’t know much about it.
http://zhongwen.com/d/174/x97.htm
Chinese character relationships.
Pictures of old forms of hanzi.
http://www.chineseetymology.org/CharacterEtymology.aspx?characterInput=家
Handwriting recognition engine.
So, this is awesome, and will take me a lot of time to digest. Like, a lot. I really appreciate it and will write back when I've had the time to go through it and/or plan out how this is going to work.
Thanks!!!
Glad I can help!
I forgot one important project, Adobe and Google's collaboration pan-CJK font Source Han Sans: https://github.com/adobe-fonts/source-han-sans
It's managed by Ken Lunde who wrote the CJKV Information Processing book and is an absolutely fantastic font with glyphs for Chinese (Simplified and Traditional), Japanese and Korean.
Here's an introductory blog post: http://blog.typekit.com/2014/07/15/introducing-source-han-sans/, and the project's Readme: https://github.com/adobe-fonts/source-han-sans/raw/release/SourceHanSansReadMe.pdf
Also consider:
CJK Decomposition Data 75'000 Han characters broken down similar to CJKVI ("Kanji Database Project" in @Kimtaro's post above) and Chise. https://cjkdecomp.codeplex.com/
Pomax's Indigo kanji decomposition 4000+ kanji decompositions. http://pomax.nihongoresources.com/index.php?entry=1225052300 (search page for 'online for downloading')
IDSgrep A powerful grep-like query system for Han character decomposition databasess, ships with the ability to search KanjiVG, Chise, and CJKVI, should be compatible with the above two sources. (I wrote a tutorial and also ported it to Javascript in case you want to play with it in the browser.) http://tsukurimashou.sourceforge.jp/idsgrep.php.en
Sljfaq kanji handwriting recognizer Has a 'strict' mode where the number of stroke orders has to be correct and with no look-ahead prediction: very useful for practicing correct stroke orders. Author has added an iframe-based solution for embedding in other websites. Based on JavaDict, very different and much simpler than the SVM-based machine learning approach of Zinnia. I would love to see this re-implemented with a permissive license. http://kanji.sljfaq.org/
Kanjium A very wide-ranging collection of kanji-related data with a permissive license. Contains corrections to its various sources including KanjiVG (author usually pushes the corrections upstream) https://github.com/mifunetoshiro/kanjium
Kakijun Kanji stroke order database that serves, for me (and the Kanjium author) as a gold standard for stroke order correctness. http://kakijun.jp/
Awesome! Thanks @fasiha and thanks again @Kimtaro. Will look into these.
From @rtxanson.