baskerville / plato

Document reader
Other
1.26k stars 105 forks source link

Some StarDict dictionnaries have their html tags exposed when converted. #253

Closed foldfree closed 2 years ago

foldfree commented 2 years ago

--version Plato 0.9.30 --device Kobo Libra h20

Following theses instructions, I added StarDict dictionaries from ebook-reader-dict in the /dictionaries/ folder. It used to work fine a couple of months ago but since I updated the dictionaries today, I have a formatting issue where html tags for font style are visible (see the following image): photo_2022-08-27_16-06-16

Not sure if I did something wrong or if BoboTiG changed the formatting of their dictionaries.

baskerville commented 2 years ago

This particular definition breaks the HTML detection routine: the first non-blank character is \ and not <.

foldfree commented 2 years ago

All the definitions are formatted the same sadly, the English dictionary as well. I'm guessing other languages provided by the repo does too. Would it be possible to fix it?

baskerville commented 2 years ago

You can try the fix by amending convert-dictionary.sh or you can wait for a new version to be released. In both cases you'll have to let the StarDict dictionary be reconverted.

BoboTiG commented 1 year ago

Out of curiosity, should we change something on our side in https://github.com/BoboTiG/ebook-reader-dict to prevent using hacks or reconverting dicts?

foldfree commented 1 year ago

Out of curiosity, should we change something on our side in https://github.com/BoboTiG/ebook-reader-dict to prevent using hacks or reconverting dicts?

I am not a dev but I am guessing replacing backslashes \by the html code &#92; could be a solution?

baskerville commented 1 year ago

Out of curiosity, should we change something on our side in https://github.com/BoboTiG/ebook-reader-dict to prevent using hacks or reconverting dicts?

If the first non-blank character in the definition is <, then the definition is seen as HTML by Plato's Dictionary application, otherwise just text. Since some of the definitions start with raw pronunciation strings (for example \ˈsɪɡ.mə ˈæl.dʒɪ.bɹə\), I've had to wrap those strings in a tag.

MoTem commented 1 year ago

Since the issue has yet to be resolved by either Plato nor BoboTIG, can someone guide me on what exactly to edit in the convert-dictionary.sh?

baskerville commented 1 year ago

Since the issue has yet to be resolved by either Plato nor BoboTIG, can someone guide me on what exactly to edit in the convert-dictionary.sh?

On Plato's side, the issue was resolved on August 27, 2022 by https://github.com/baskerville/plato/commit/67bd7fba69737840f7fbfea0864c8141e912e0d7.

occivink commented 1 year ago

I've imported these dictionaries (versions 3.0.0 from 2023-05-01) into plato 0.9.35 and the issue persists. This occurs with all the dictionaries I've tried (english french and german). There doesn't seem to be anything of interest in plato's log. screenshot

occivink commented 1 year ago

Unfortunately the fix worked for english, but not all dictionaries. I had not realized that it relied on the pronunciation delimiters, which appears to be \abc\ for French, /abc/ for English, and [abc] for German. I had a brief look on wiktionary and it looks like that should cover most languages. I'm not sure if this is something that should be normalized in the export, the pronunciation wrapped in a paragraph (@BoboTiG) or the workaround extended. For now I've done the latter by changing the sed call to sed 's|^\([\[/].*\)|<p>\1</p>|', but that looks like a fragile fix to me.

baskerville commented 1 year ago

I just noticed another bug in the english dictionary: all the definitions end with a closing html tag but have no opening tag.