Closed metzkorn closed 3 years ago
The way the indices are calculated seems kind of bizarre (at least from what my text editors are showing me). There seems to be an extra character in every line at first, but then that breaks down.
Most lines seem to count as an extra character, but not every line. Going to try using a different editor. This might be affecting the actual app too? At least some characters towards the bottom of cedict_ts.u8 aren't detectable. For instance, 𬭳 isn't detectable as well as the other 10 final entries.
Upon further investigation, this seems to be an error with the encoding scheme used. The one extra character per line comes from \r\n. But other extra characters appear because characters like 𪨊 get rendered as ��.
I have written a python script that updates the indices. I'll consider refactoring it so it's easy to use for others, but everything anyone would need to update indices in updateidx.py in my fork of the project.
cedict.idx seems like it was generated from cedict_ts.u8. I modified cedict_ts.u8 so I need to update the indices. I can probably go ahead and write a script to do so myself, but before I do figure I'd ask here.