cburgmer / cjklib

Han character library for CJKV languages
Other
150 stars 49 forks source link

State of the cjklib / understanding our datasets #3

Closed tony closed 12 years ago

tony commented 12 years ago

I think it'd be good to get a state of matters for where we stand on cjklib in terms of its current codebase. Do we want to use it? As it stands, I'm not sure if I'm failing to grasp the complexities of comingling our data, or if there are architectural mistakes within that just would be best if we rewrote it.

If that is the case - I wonder if you could take some time to document what is what from a data perspective. Here are few questions that'd be helpful to have answers on:

More specifically, what is the following:

and

What are the above? Why are some included while otheres are downloaded remotely? Can we package any/all of the remote data in cjklib? Is it it matter of licensing of assuring downloading of fresh data?

What data in the above datasets intersect, where?

If there is a place where the data intersects, often, I'm assuming we're massaging it in some sense so we can match it to a lookup? Maybe it'd help to have a spreadsheet / table on this?

I think that if we mapped the data we have to a spreadsheet it'd offer us all a better view of the picture - imo. Then we can take a look back away from legacy assumptions and be in a better position to make pull requests for larger architecture changes.

I realize the above is a pretty time-consuming thing, think you could take a bite at it though?

cburgmer commented 12 years ago

Tony, sorry for making you wait for so long.

While I feel that your questions are valid, a bug tracker might be the wrong place for discussing those. If we continue discussing, could you please take it to the mailing list? It might even make me respond quicker: https://groups.google.com/forum/?fromgroups#!forum/cjklib-devel

All the data files that live in this project are hand crafted for the use with cjklib. You can use the Python API to access all the data.

So to answer some of your questions:

edict, cedict, cedictgr, handedict, cfdict are all dictionaries. They are downloaded on the fly (so they are up-to-date) and can be queried via cjklib's dictionary API.

The list of files that you mention cover different things. For example readings of Chinese languages (Mandarin, Cantonese, Shanghainese) in some of their respective romanisation schemes. Some files describe chinese characters, their composition out of smaller elements, also strokes.

I did make sure to document what those lists were, and where the data comes from.

kanjidic2 and Unihan are used to derive information that either non of the own sources cover or don't cover to that extent. However for Unihan I can say that it doesn't provide the quality for the use case that I developed cjklib for so in general having more "own" data would be good.

So, the data is capture inside cjklib, not very visible for people from other language backgrounds, or even non-programmers. Ideally the data would go into some sort of web page independently from cjklib.

tony commented 12 years ago

@cburgmer: 哪里哪里! Thank you for the response I'll note that google groups discussion list is preferred.

I am not strong enough in python to write something the pythonic way myself, but having a high level overview of cjklib's python code would be nice, have you ever seen http://www.aosabook.org/en/index.html? If a sage were to write up an overview of cjklib in that style it'd be cool.

In the mean time, if I delve into this subject / other things further I will bring it to the list.