LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading. Python/Flask.
MIT License
407 stars 45 forks source link

Add Kobo dictionary support (requires issue 5 to be done) #6

Open jzohrab opened 1 year ago

jzohrab commented 1 year ago

This is a good idea, simple offline-style dict.

jzohrab commented 1 year ago

See jzohrab/lute-v3#5 for initial notes.

The kobo dictionaries at https://www.epubor.com/kobo-dictionary-download-and-install.html are good starts, but you need to change the http links to https.

When the dict is downloaded, if you decompress the zip, it contains a bunch of files, e.g. co.html, but these are in fact compressed data. You can decompress them, eg

cp ca.html hack_co.html
gzip  -S .html -d hack_co.html
mv hack_co hack_co.data

and this results in a file called hack_co.data with data like the following:

<w><p><a name="correr"/><b>correr</b> [koˈreɾ]<br/><br/>
<p>Del latín <i>currere</i></p><br/><ol><li>Desplazarse rápidamente ....</li>
...
<variant name="corra"/>
<variant name="corre"/>
...

So, these files could be pre-processed to have all (??) variants of a word, and the word itself, being an initial index into the data files, and a Lute-Kobo lookup could look like this: Given input word fui, pre-processed file initial_index_fu.data contains something like fui: ir (fui being one of the variants of ir, we hope!), and then the actual lookup is done using ir to get the definition.

I don't know how this would/should work for ambiguous mappings. Perhaps something like gato: gato; gatar (if there is a word like gatar).

if nothing is found, just return 'not found'.

The pre-processing could be done outside of Lute, or as a heavyweight initial load. Outside is better, I think: less crap to go wrong in the app, separate concern.