RichardLitt / kanjin

Idea repo for a site for making a registry of all kanji with OCR capabilities
2 stars 0 forks source link

Look into jisho.org #1

Open RichardLitt opened 9 years ago

RichardLitt commented 9 years ago

From @rtxanson.

Kimtaro commented 9 years ago

Here’s a brain dump of projects that might be good to grab data from or that can serve as inspiration. Some of these are in use in Jisho and others I’m looking at including.

My main knowledge is around Japanese so I know very few Chinese resources and don’t have any for Korean or Vietnamese unfortunately.

If I think of anything else I’ll add it to this thread. Also feel free to ask anything about Jisho.

Kanjidic2

http://www.edrdg.org/kanjidic/kanjd2index.html

Jim Breen’s Japanese kanji database file. Accessible through Jisho.org or Jim's WWWJDIC. Contains a lot of data.

Radkfile

http://www.csse.monash.edu.au/~jwb/kradinf.html

Also from Jim Breen. Data file that breaks down kanji into radicals/components.

CJKV Information Processing

http://shop.oreilly.com/product/9780596514471.do

Not a site, but this is THE book on how to work with CJKV characters.

KanjiVG

http://kanjivg.tagaini.net

Database file with stroke and shape information for kanji. This powers the stroke order diagrams on Jisho.org.

Wiktionary

http://en.wiktionary.org/wiki/家

Wiktionary contains a lot of kanji/hanzi/hanja data and etymology.

Unihan

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=91D1&useutf8=true

Unicode’s database of all han characters.

Kanji Database Project

http://kanji-database.sourceforge.net/index.html

A multitude of database files with data for kanji/hanzi/hanja. I just discovered this recently and haven’t had time to explore it much, but it looks nice. Data is on GitHub: https://github.com/cjkvi

GlyphWiki

http://en.glyphwiki.org/wiki/GlyphWiki:MainPage

Wiki with glyph information for a multitude of kanji/hanzi/hanja. Contains glyphs that are not in Unicode.

Kanjium

https://github.com/mifunetoshiro/kanjium

A fork and expansion of Kanjidic2/Radkfile.

CHISE project

http://www.chise.org/index.html.en

This is supposed to be a repository of character information, but I don’t know much about it.

zhongwen.com

http://zhongwen.com/d/174/x97.htm

Chinese character relationships.

Chinese Etymology

Pictures of old forms of hanzi.

http://www.chineseetymology.org/CharacterEtymology.aspx?characterInput=家

Zinnia

http://zinnia.sourceforge.net

Handwriting recognition engine.

RichardLitt commented 9 years ago

So, this is awesome, and will take me a lot of time to digest. Like, a lot. I really appreciate it and will write back when I've had the time to go through it and/or plan out how this is going to work.

Thanks!!!

Kimtaro commented 9 years ago

Glad I can help!

Kimtaro commented 9 years ago

I forgot one important project, Adobe and Google's collaboration pan-CJK font Source Han Sans: https://github.com/adobe-fonts/source-han-sans

It's managed by Ken Lunde who wrote the CJKV Information Processing book and is an absolutely fantastic font with glyphs for Chinese (Simplified and Traditional), Japanese and Korean.

Here's an introductory blog post: http://blog.typekit.com/2014/07/15/introducing-source-han-sans/, and the project's Readme: https://github.com/adobe-fonts/source-han-sans/raw/release/SourceHanSansReadMe.pdf

fasiha commented 9 years ago

Also consider:

CJK Decomposition Data 75'000 Han characters broken down similar to CJKVI ("Kanji Database Project" in @Kimtaro's post above) and Chise. https://cjkdecomp.codeplex.com/

Pomax's Indigo kanji decomposition 4000+ kanji decompositions. http://pomax.nihongoresources.com/index.php?entry=1225052300 (search page for 'online for downloading')

IDSgrep A powerful grep-like query system for Han character decomposition databasess, ships with the ability to search KanjiVG, Chise, and CJKVI, should be compatible with the above two sources. (I wrote a tutorial and also ported it to Javascript in case you want to play with it in the browser.) http://tsukurimashou.sourceforge.jp/idsgrep.php.en

Sljfaq kanji handwriting recognizer Has a 'strict' mode where the number of stroke orders has to be correct and with no look-ahead prediction: very useful for practicing correct stroke orders. Author has added an iframe-based solution for embedding in other websites. Based on JavaDict, very different and much simpler than the SVM-based machine learning approach of Zinnia. I would love to see this re-implemented with a permissive license. http://kanji.sljfaq.org/

Kanjium A very wide-ranging collection of kanji-related data with a permissive license. Contains corrections to its various sources including KanjiVG (author usually pushes the corrections upstream) https://github.com/mifunetoshiro/kanjium

Kakijun Kanji stroke order database that serves, for me (and the Kanjium author) as a gold standard for stroke order correctness. http://kakijun.jp/

RichardLitt commented 9 years ago

Awesome! Thanks @fasiha and thanks again @Kimtaro. Will look into these.