Paperfeed / LiuChan

A Chinese mouseover dictionary extension for Chrome
https://paperfeed.github.io/LiuChan
20 stars 11 forks source link

Added Cantonese support, fixed some bugs. #11

Open jckt opened 6 years ago

jckt commented 6 years ago

Added Cantonese pronunciation support

Main features added:

Fixed bugs, notably:

Paperfeed commented 6 years ago

Which dictionary are you using for cantonese? Is it CC-Canto from http://cantonese.org/?

edit: Thanks for the contribution by the way :)

jckt commented 6 years ago

The raw data comes from CC-Canto and the CC-CEDICT Cantonese readings (both from cantonese.org), these were processed into a single file.

You're welcome!

gkovacs commented 5 years ago

This is great! Tested it and it resolves an issue with words like 捨棄 failing to be looked up that has been constantly annoying me, is there anything blocking this from being merged?

gkovacs commented 5 years ago

I noticed that with this branch jyutping seems to be unavailable for 律 and all words containing it ie 法律,律师,旋律,音律,因果律,定律,菲律宾 - I'm not sure why

gkovacs commented 5 years ago

Found the reason for the above error, it looks like the scripts that generates cedict_combined.u8 might have some bugs as it doesn't seem to include jyutping everywhere. See the below (jyutping should be between the { } )

法律 法律 [fa3 lu:4] { } /law/CL:條|条[tiao2], 套[tao4], 個|个[ge4]/
gkovacs commented 5 years ago

Oh this seems to impact every word containing a character that has pinyin pronunciation v (u:), like 女,绿,吕,驴. Presumably an issue with the script that generates cedict_combined.u8 (which unfortunately doesn't seem to be included in the repository)

jckt commented 5 years ago

I wrote a big message just now about how in general I've tried to avoid autocompleting jyutpings on a per-character basis (leads to many errors, even the Pleco dictionary on iPhone has it, which uses a better version of the CC-Canto sources AFAIK). But you're right, actually in this case it's my fault and that there is a bug in the generator scripts. In fact, the entry is double-entered; somewhere else in the file: 法律 法律 [fa3 lv4] {faat3 leot6} So there's now two ways of expressing ü in the dictionary (I forget if this is a problem, I'll check again soon when I have the time). In this case I guess one could either condense the two entries (easy in this case since the entry above is deformed -- it as no / / field for a (blank) definition, so the regex just misses it completely (that's why it doesn't even show up as a definition-free entry). Or one can just leave the two entries but auto-clean the pinyins and / / definition field. I'll try to fix it as soon as I have the time.

For now, I've attached the dictionary generator scripts. I didn't include them in the branch since I thought I would quickly clean them up and include some autocomplete system that also gave correct results (but that's actually a much harder problem than I thought it was).

Thanks again for pointing this out. generators.zip

orientalperil commented 4 years ago

@Paperfeed Any chance this can get merged and deployed to the Chrome Web Store? I'm interested in being able to use Cantonese and can help push this along if more changes are needed