batterseapower / pinyin-toolkit

A plugin for the Anki Spaced Repetition System (http://ichi2.net/anki/)
http://batterseapower.github.com/pinyin-toolkit/
39 stars 14 forks source link

Simplified Chinese to Traditional Chinese (and vice versa) converter #58

Closed Nick3C closed 15 years ago

Nick3C commented 15 years ago

I think the easiest way to do this will be to use expression as a source. Then look for a field called simplified or traditional. If one if found then do a double convert [to make sure whatever is in expression is the opposite) and then convert into the traditional / simplified field.

I am quite keen to get this working by 0.06 because I would like to use it to practice my traditional sentence reading :p

There is an obvious choice of using either local dictionary or google translate. Obviously the former is better if it is easy to implement. I suspect, from when we wrote the first gTrans() that it will be very easy to implement changes to get gTrans to do this for us.

batterseapower commented 15 years ago

See also:

Not massively hard to do this somewhat well locally, at least to some degree (and this should be preferred IMHO). Mapping traditional -> simplified phrases looks a bit harder unfortunately.

Doing a double translate is probably a bit fragile because the mappings in each direction are not necessarily injective or surjective. We could probably write a method determining if some text is predominantly simplified or traditional in nature however, and then translate to get the other variant.

Nick3C commented 15 years ago

ok, it does have benefits, but I think it is a lot more work. I guess google handles this fairly well already which means it could have been easily implemented.Their engine is designed to do this after all and I have never seen an error in a phrase.

anyway, challenges: 1) characters with a single form in simplified that have multiple forms in traditional (i.e. many-one simplification)

2) The one-to-many problem - the simplification of traditional into chinese meeant that some characters merged. This is fine, however when we go from simplified to traditional it creates problems. It's gonna mean we need to do a full word-lookup in the same way, and for the same reasons as pinyin.

3) a "probably" approach just isn't going to cut it I'm afraid. There is too much chance that a user (thinking it is a new simplified character they don't know) will add one or two traditional characters into a phrase, from the web, etc. It is very easily done. It's going to get horrible if we guess based on the content of the phrase.

What we can do is look at what the user wants and meet that demand. If he has one field called Chinese (our main field for other purposes) then an extra field called traditional it is very safe to presume he wants only in that field. If there is a field called Simplified then the same applies in revers.

Using traditional as an example, what we can do is create a trad->simp dict and a simp->trad dict and then pass every word in the phrase (must be words, not phrases because of many-to-one problem) and guarantee we will get traditional at the end.We can then reverse that to guarantee that we get simplified in the original field (removing any traditional we might inadvertantly have in that original phrase).

this really is going to be the only way to do it which is why I wanted to let google translate handle it (much simpler :p ). See what I mean though? Makes sense?

batterseapower commented 15 years ago

Nick implemented this already. We can come back to replacing Google with a local translation once we get Unihan integrated.

Nick3C commented 15 years ago

yes that was the plan.

gtrans should be right like 80 to 90% of the time, 1 or 2 in 10 isn't too bad I guess :)

Nick3C commented 15 years ago

Thinking about it, this is a prime example of where google search could be used to pick the more likely of two outcomes. I had a cambridge compsci friend who ran a project on this. I should get in touch with him.

He was using google search as a spell check. his program would send two possible spellings to google and use the count to determine which was more likely to be right.

We could do the same thing, i.e if characters have multiple possible forms we could take the phrase and iterate through the various combinations, pass them all to google and see which occurs most often. That would be most likely to be correct.

How cool would that be? :)

Nick3C commented 15 years ago

come to think of it, we might be able to use this method to improve pinyin accuracy too. If we convert our phrase to traditional then there each character has fewer possible pronunciations.

Therefore, if we took a simplified phrase and converted it to traditional, ran it through google to see which was most likely, and then looked up the traditional instead of the simplified the pinyin generation should be somewhat better.

Little complex but I absolutely love the idea of using google in this way.