cschiller / zhongwen

Official source code of the "Zhongwen" Chrome extension
https://chrome.google.com/webstore/detail/zhongwen-chinese-english/kkmlkkjojmombglmlpbpapmhcaljjkde
GNU General Public License v2.0
314 stars 52 forks source link

Add CedPane dictionary supplement? #60

Open ssb22 opened 3 years ago

ssb22 commented 3 years ago

Hi, I maintain a public-domain Chinese-English dictionary supplement (currently about 64,000 entries), it's data that is not usually in "normal" dictionaries but still useful to have in a reader (mostly names of people and places). If including these extra entries, I would suggest labelling them in some way to differentiate them from the "main" CEDICT, as it seems the CEDICT editors are not sure they want to merge in CedPane entries en-masse (and anyway I'm still writing it).

If you do want to merge in, I think the best starting point would be the CedPane ChinaScribe file because the format of that is quite similar to CEDICT. The main difference is that some of the pinyin syllables are separated with _ instead of space: this indicates a word boundary in a multi-word phrase; if you can't cope with multi-word phrases then these entries are possibly best dropped. And some of the definitions are in <...> to indicate an environment (e.g. PRC, TW, netspeak, etc). Other than that it's basically the same apart from the sort order.

I don't know what is your current method of generating cedict.idx and your modified cedict_ts.u8 from upstream (do you have scripts to do this / can you put them in the repo for reference?), but I'd imagine merging in another source (and perhaps labelling every definition with [CedPane] or similar so as to differentiate them from mainline CEDICT) shouldn't be difficult.

It's nice that you are able to push out several updates a year. I currently tend to publish my CedPane edits on the last day of each month, although that's not a guarantee. If you do a git pull from my repo as part of your normal update script, that should work. Alternatively on the CedPane home page there's always a "Last update" and entry count.

(I used to make a text file of CedPane available for download from the home page, but then a developer in China thought it was a good idea to write a lookup extension that re-downloads CedPane from my server every time it was used, which caused hundreds of gigabytes of traffic—when I say “keep up to date” I don't mean that much☺ You can still get it from the home page but it's now a ZIP file which I hope will discourage our extension-writing friend from hammering my server. Meanwhile it also lives on the major Git providers which have more bandwidth. Including the data in the extension with periodic updates, as you do, does seem to be a better way.)

ssb22 commented 3 years ago

Sorry I forgot to mention that the reason why I wrote this ticket was because a user of your extension emailed me asking me to fork your extension into a version with CedPane added. I'd rather avoid creating a fork if it's something that can be done upstream.

ienablemuch commented 3 years ago

@ssb22

I appreciate that users won't anymore spent a great deal of time analyzing enigmatic series of Chinese words when those Chinese words are merely people's transliterated names, compound words, company names, colloquial phrases, and idioms. In fact, I included your dictionary in my Chinese Words Separator extension for Chrome https://chrome.google.com/webstore/detail/chinese-words-separator/gacfacdpfimbkgcnlegknnmcccjgcbnp

It will help a lot of Chinese language learners to save time from over-analyzing a series of hanzis. Here's an example of my extension result, before and after I included your CedPane dictionary:

image

However, there are phrases that I feel should not be in the CedPane dictionary, for instance:

允许安装来自未知来源的应用

Yǔnxǔ ānzhuāng láizì wèizhī láiyuán de yìngyòng

I feel that it's not a compound word, nor a colloquial phrase that should be remembered by heart by Chinese language learners. For that matter, I want to exclude it, so I made my code's compound-words look-ahead limited to a certain length, so those kind of lengthy phrases will be excluded from the extension's compound words mechanism. There are more phrases that I think should not be in the CedPane's dictionary

The hesitancy of some Chinese dictionary tool makers to include CedPane's dictionary to their dictionary, stems from those examples, I believe.

image

ienablemuch commented 3 years ago

@ssb22 It's a good idea to put a field on CedPane's dictionary to indicate if something is a name, compound words, colloquial phrases, or idioms. Or for examples such as 允许安装来自未知来源的应用, it should be indicated as an accurate translation of commonly occurring phrases

ienablemuch commented 3 years ago

@ssb22 Here's another output of Chinese Words Separator extension with your CedPane dictionary included:

image

Without CedPane:

image

Thanks :)

ssb22 commented 3 years ago

Thanks, I think the easiest way to omit the "phrase" entries is simply to omit any entry that has a _ in it in the ChinaScribe format (or a space in the pinyin column in the main format). That works better than using a length limit, because a length limit may omit entries like 图斯潘德伯拉尼奥斯镇 (the town of Tuxpan de Bolaños, yes I saw that in a news article in 2018).

Phrases like "allow installation for unknown sources" are included because we occasionally need them for translating English into Chinese. I find people tend not to understand my technical instructions unless I can quote the exact wording that's displayed on their screen, not just a paraphrase of it, so yes we do want to be able to look up how things like that are worded in Chinese. But they are not meant to be displayed without spaces, which is why I include spaces in the pinyin field of cedpane.txt, and _ characters in the ChinaScribe format.

If anyone has code that can process phrases including spaces, I'd rather they include the multi-word phrases because some of these entries are used to "clear up" what would otherwise be a difficult case for a computer to get right. For example, the entry 万国都 is 2 words, and it is meant to clarify that, in the texts I've seen, 万国都 should be written as "wànguó dōu" (all nations + all), rather than "wàn guódū" (myriad + capital cities). Otherwise, software like Wenlin might incorrectly put "wàn guódū" because 国都 has a higher usage frequency than 万国 (usage frequency is the wrong signal to use in this case, so I added an 'override' phrase entry).

CEDICT also has a few 'long phrase' entries (like 金窩銀窩不如自己的狗窩) which I don't think should be written without spaces. Unfortunately, CEDICT doesn't have the _ characters I use to indicate spaces, so about the only thing you can do with that data is to have a length limit. But for CedPane you can look out for the _ characters (or spaces in the pinyin field of cedpane.txt) to identify this type of entry.

cschiller commented 3 years ago

Hi @ssb22 , thanks for getting in touch. You obviously put a lot of work into compiling your dictionary and the result is very impressive.

I would actually prefer if you made your work available via publishing it through CC-CEDICT. That dictionary already includes a number of names of famous people and well-known places. I believe this approach would have several benefits:

Anyway, I respect the amount of work you've put into this. By working together with the CC-CEDICT team you could make it available to an even wider audience and it would be a win for everybody.

ienablemuch commented 3 years ago

But they are not meant to be displayed without spaces, which is why I include spaces in the pinyin field of cedpane.txt, and _ characters in the ChinaScribe format

I overlooked the file (CedPane-ChinaScribe.txt) that have word boundaries delimited by underscore _ character, I used the cedpane.txt initially. I uses the ChinaScribe file now, I included back the CedPane's phrases to Chinese Words Separator extension. Besides CedPane's names and compound words, the commonly occurring phrases now also jumps out of screen, at least with the use of an extension

image

Is there a version or fork of CC-CEDICT that is in ChinaScribe format? It's neat when pinyin have word boundaries like underscore , not just 'syllable' boundaries. Indeed, the idiom there's no place like home is rendered with no spaces as it is treated as one word due to the CC-CEDICT source dictionary having no word boundaries :)

image

ssb22 commented 3 years ago

Thanks @cschiller . I believe I was ostracized by the CEDICT team after an email misunderstanding 4 years ago, and I wouldn't want to annoy them by trying again now.

The CEDICT license doesn't let developers mix CEDICT data with other data, unless that other data also has a CC license. I was in the awkward position of having been given special permission to use certain proprietary data in a zero-cost zero-profit Android app, but I didn't have permission to CC-license that data, therefore I could not mix in CEDICT (unless CEDICT gave me an exception to the "must CC it" rule, which they didn't). I did try Adsotrans data for a while, since Adso's license did allow mixing without a CC requirement. But I found issues with the quality of Adso's data, so ended up going my own way instead.

Pleco has an innovative way of keeping dictionaries separate while still letting you use several, so Pleco is able to use both CC-CEDICT and proprietary dictionaries if you want. But not all apps can be written like Pleco (and I could not figure out how to make my Annotator Generator work anything like Pleco) so I couldn't just do it that way. I felt public-domain data would be least likely to cause problems for developers.

Sure I'm happy for CEDICT to use the data as long as they don't try to stop me from keeping the public-domain version available as well. They would probably want to review everything before inclusion, which could end up being a lot of work. In the short term it may make more sense to have CedPane as a separate source, and perhaps label your entries so everyone knows which of them have been edited by CEDICT versus which of them have only been edited by me. I suppose it's not impossible CEDICT could decide my editing is good enough to import without further review, but that is not my call to make! At the very least, I'd want to draw their attention to:

etc. Marking all entries as "from CedPane" until reviewed could be one way to shift any blame.

@ienablemuch the only other dictionary I know of that uses ChinaScribe format is the one bundled with ChinaScribe itself which is commercial Windows software with free trial (it sometimes works on WINE depending on the version). The License Agreement that pops up when you install it says: "Many ChinaScribe entries are derived from the following sources: CC-CEDICT Chinese-English dictionary. Available free of charged and licensed under a Creative Commons Attribution-Share Alike 3.0 License." So I suppose that means, although ChinaScribe is commercial software, its dictionary is a CC-CEDICT fork and can be used with other programs if you can get the data out of ChinaScribe. (The paid-up version has File / Export dictionary entries, but this refuses to run on the unpaid version. The internal file is typically drive_c/users/Public/Application Data/ChinaScribe/MainDict.cs1 but it's in some binary format I haven't figured out.)

(Edit: formatting)

ssb22 commented 3 years ago

Incidentally ChinaScribe merged in CedPane in 2017, but I haven't checked to what extent they're keeping up since then. (I do keep an entry in CedPane for CedPane itself, with the date on it—see for example Ce.html—I figured that this entry could be used to check when a project that imported CedPane last did so, assuming they kept that entry. ChinaScribe doesn't seem to have it at the moment, which might perhaps mean their last import predates when I first added it.)