batterseapower / pinyin-toolkit

A plugin for the Anki Spaced Repetition System (http://ichi2.net/anki/)
http://batterseapower.github.com/pinyin-toolkit/
39 stars 14 forks source link

Add Support to get 'Part of Speech' From Dictionary #23

Open batterseapower opened 15 years ago

batterseapower commented 15 years ago

Twin check: 1) use HanDeDict to look for type fields (details to follow) 2) if CEDICT contains "to " then assume verb, if contains "a " then assume noun

Nick3C commented 15 years ago

I had a true brain -wave. I noticed that many words in the dictionary are extremely technical. I cross-checked some with CC-CEDICT and found that they have no entry. Thus the brainwave:

The short answer is: yes, much better... the long answer will take a while...

Consider:

Consider, we can do this (assuming auto-fill of category and type tags too)

CC-CEDICT: [no entry] so we query google translate: Chinese: 早期基督教 CN-EN gTrans: Early Christian Now, I happen to know that 早期 means "early time period", 基督 is a short-form for Christian and 教 means teaching. So the translation is probably closer to early Christianity. Not bad google, helpful but not quite there.

So what happens if, instead of passing the entry to google we had passed it to HanDeDict first? We would have got (with auto-tag expansion) German: n Alte Kirche [rel] gTrans German-EN: n Old Church [religion]

okkkay, now this is actually correct (because we are talking about the church as an institution rather than as a physical thing). It isn't perfect (still ambiguous) but it is an improvement.

How about instead of doing this we return both: Query: 早期基督教 CC-CEDICT: [nothing] Result in field: (1) Old Church religion (2) Early Christian [HanDeDict, gTrans]

From this entry we can actually get much closer to the meaning I just gave. I know it's not still perfect, but what I am suggesting essential exploits several facts: 1) HanDeDict is a specialist dictionary with many good quality German-Chinese translation 2) Google Translate (while good) is a generalist service 3) It is more likely that HanDeDict will be correct than google [from Chinese into whatever language] 4) the loss of meaning between English and German is likely to be small compared to the loss of meaning between English and Chinese

Let's try a few more...

Query: 增值税提高 CC-CEDICT: [nothing] Nick's translationg: Ummm, errrr, literally: increase tax improvement [I wouldn't have a clue though!] gT: Improve value-added tax HanDeDict via gT: (1) VAT Receipt economics (2) Receipt with the VAT This is an even better example. I am pretty sure that HanDeDict beats both myself and google translate here because it makes sense and provides two consistent examples!

Let's try again... Query: DA汇流排闭合 CC-CEDICT: [nothing] Nick's translationg: I would guess something like D-A gT: DA bus closure HanDeDict via gT: DA-track closed (verb) Ok, looks like we didn't get a hit here. Having said that, with the two meanings together I might be inclined to think it was a German highway-related term. Ok, next thing... I had previously been considering searching for "to" in the dictionary to generate data for a field called "type" with things such as Verb, noun, etc (I use this on my decks to make testing from Pinyin/Audio to English/Chinese possible [without it it can be simply to hard to work out what a word is, after all you get this context in real life].

Anyway, it occurred to me that I had seen some data in this form in handedict, so I took a look at it. Now it does have the data, which is great. I have decorded some of it... things like: (V) verb (S) noun (Adj) adjective (V,[...] verb plus another tag (Eig, Geo) proper noun, geography

(S, Mus) noun, music (Pers) person (S, Met) meteology (S, Tech) technology (s, Chem) chemisty (EDV) computers (S, Wirtsch) economics (Org) organisation

This is good. It means HanDeDict can be used to do this, even when in English or other mode (can default to returning type in Chinese, and offer customisation). You will notice also that there are categories (which are fixed and standard and can be quite useful). However, this was not my main point...

batterseapower commented 15 years ago

I think this is usually called the "part of speech"

Nick3C commented 15 years ago

A combination approach is probably the best way.... Carry out each of the test below, appending to the POS variable (check to see if type is there first) using a known list of types: 1) check handedict and get type back from this 2) look for a "to " in cedict (i.e. verb) [prefaced by linestart or space] 3) look for a capital letter in the pinyin (proper noun) 4) look for an "a " or "an " in cedict [prefixed by linestart or space] 5) ...

batterseapower commented 15 years ago

Well, handedict seems to have all the info we need anyway, so if we use a SQL DB we can just extract that for use with all languages. I'm not really sure how to exploit this though, because there will be many parts of speech for any given character.

Nick3C commented 15 years ago

Handedict seems to have a limited set of categorisations which fails to draw distnctions between different types. For instance verb objects seem to be treted as verbs (which is not really acurate) and other concepts such as attributive do not exists (or I did not find them because I obviously need to transalte them before working out what they are :).

Thus handedict is a base for this (because it has a lot of data) but unlikely to be the be all and end all.

The problem with Chinese is that words are very often in more than one catagory. My own (self-generated) pos are something like: verb / noun [1 char] verb (usually) / noun (rarely) [1 char] verb object / (adjective) [2 char] verb (add 为 character) [2 char]

and so on. Thus I would suggest this is a protected field (like audio) that the user can modify.

To generate the data we can search for a type and append it to a list. If one character is a verb in one meaning and a noun in another then we would grab both (but only once for each). This feature will require some user interaction. At higher levels I think this is essential.

Part of speech can also be usefully suffix with [x characters] which makes it much easier to do translations from English into Chinese when there are multiple Chinese words matching the english meaning.

Also we should have an option to use either English or Chinese for our tags (just make a dictionary so that (n) is ["名字","noun", "[German]","[French]") and assume users want Chinese (better practice) unless they select local language in settings.