cschiller / zhongwen

Official source code of the "Zhongwen" Chrome extension
https://chrome.google.com/webstore/detail/zhongwen-chinese-english/kkmlkkjojmombglmlpbpapmhcaljjkde
GNU General Public License v2.0
312 stars 52 forks source link

Is the script used to generate the indices available? #81

Closed metzkorn closed 3 years ago

metzkorn commented 3 years ago

cedict.idx seems like it was generated from cedict_ts.u8. I modified cedict_ts.u8 so I need to update the indices. I can probably go ahead and write a script to do so myself, but before I do figure I'd ask here.

metzkorn commented 3 years ago

The way the indices are calculated seems kind of bizarre (at least from what my text editors are showing me). There seems to be an extra character in every line at first, but then that breaks down.

metzkorn commented 3 years ago

Most lines seem to count as an extra character, but not every line. Going to try using a different editor. This might be affecting the actual app too? At least some characters towards the bottom of cedict_ts.u8 aren't detectable. For instance, 𬭳 isn't detectable as well as the other 10 final entries.

metzkorn commented 3 years ago

Upon further investigation, this seems to be an error with the encoding scheme used. The one extra character per line comes from \r\n. But other extra characters appear because characters like 𪨊 get rendered as ��.

metzkorn commented 3 years ago

I have written a python script that updates the indices. I'll consider refactoring it so it's easy to use for others, but everything anyone would need to update indices in updateidx.py in my fork of the project.