Kottakji / chinese-firefox-quantum

5 stars 4 forks source link

Cantonese Additional Words #4

Open Kottakji opened 6 years ago

Kottakji commented 6 years ago

Import additional words that are specific for Cantonese

chickendude commented 3 years ago

These can just be added to the cantonese .dat file, correct? Merging the data from CC-Canto would probably provide a very good foundation, at the moment it seems like a lot of common words and characters are still missing. I could work on either incorporating the new definitions into the current .dat file or perhaps just formatting the dictionary to the format Perapera uses. According to their site it's released under a CC license:

All three titles are open-source and distributed under a Creative Commons Attribution-ShareAlike 3.0 license.

Kottakji commented 3 years ago

@chickendude Yeah, are you able to do this? If you need some help getting it to run locally, I can help you. Otherwise, if you can help me test, maybe I try to find some time, but I dont speak Cantonese. So its hard to verify it.

chickendude commented 3 years ago

I should be able to get them added, especially if it's just formatting them to the current format and appending them to the .dat file. A lot of common words like 边个 ("who/which") are missing from the current dictionary. CC-Canto, while not perfect, has a lot more Cantonese-specific entries and would make it much more useful as a Cantonese pop-up dictionary. I'll take a look at it this weekend, i don't imagine it'll be too complicated/take much time.

chickendude commented 3 years ago

@JorisKok I got a bit of time today to work on this and after adding in the CC-Canto dictionary Perapera becomes much more useful for Cantonese, i tried reading some fairy tales and most of the words are recognized now. That's great! I see you have some scripts to scrape the Sheik dictionaries, i'm not sure if this is something you would do regularly There are a couple odd things, though. A lot of the entries are using Yale romanization and have tone marks which can sometimes cause it to not show the romanization, for example: 孤仙 [wùh sīn]\n(fox spirit)\nsomeone with armpit odor ... shows the tones just fine (and even seems to convert from Yale to Jyutping), whereas: 烏吓烏吓 [wū háh wū háh]\nmessy, sloppy, untidy, badly dressed, stupid looking ...doesn't show any romanization at all. If i change it to wu1 haa5 wu1 haa5 (Jyutping + numbers for tones) it shows up just fine.

I'm wondering whether i should go in and try to convert them or if that would just make it more complicated if you decide to scrape the Sheik dictionaries again. And tone coloring seems to be missing for one of the tones (tone 6) in the charcoal/paper themes. I can add this in as well.

Edit: Just noticed it doesn't show romanization for 五, it's romanization is ng5 so the lack of a vowel might be throwing it off.

Edit2: Just made a PR.