Generate SVG rather than GIF for embedded pictures

BoboTiG commented 2 years ago

A successfull experiementation was done in https://github.com/BoboTiG/ebook-reader-dict/issues/1182#issuecomment-1027245425 about moving embedded pictures from GIF to SVG. Results are way better, so let's do the move.

We first need to ensure this works with PyGlossary and StarDict display.

Note: PyGlossary 4.4.2 or newer is required.

BoboTiG commented 2 years ago

@Moonbase59 can you test dict-fr-fr.zip :pray:?

BoboTiG commented 2 years ago

Assigned @lasconic just to not steal your work ;)

lasconic commented 2 years ago

SVG encoded with base64 is really huge ( 2 to 3x the size of gif/png) and doesn't compress well. I would like to find a way to only use the SVG. So far, I failed to find a working way on kobo.

The GIF we currently have are :

From math/chem generated with latex. I have a solution with base64 encoded SVG. See https://github.com/BoboTiG/ebook-reader-dict/issues/1182#issuecomment-1027245425
The hieroglyphs. Here the source are bad looking PNGs... So now that we know we can use PNGs, we could replace the GIF by PNGs but that's it. Using SVG is not possible here, except if we would generate SVG from the PNG... which would be, for the most part, a manual process...

Moonbase59 commented 2 years ago

@Moonbase59 can you test dict-fr-fr.zip pray?

Using Standard and ilius’ themes:

cercle unité - GoldenDict_002

cercle unité - GoldenDict_003

Btw, I think we shouldn’t store the SVGs in /res as .svg+xml but as .svg.
Also, I wouldn’t try to generate SVG from raster images—this usually never works out (or embeds the raster image within the SVG).
Anything we can convert ourselves from LaTeX—fine. Maybe even MathML, but it’s questionable if we should try to convert that. Don’t really know what the Wiktionaries use mostly.
Plus, we might have to find a way to set SVG fill color, for people using night modes or different styles. (We can see our SVGs show "black-on-black" using ilius’ GoldenDict theme.)

Moonbase59 commented 2 years ago

I wonder if we do have to use base64 for svg, or maybe can use <svg>…</svg> directly?

Moonbase59 commented 2 years ago

KOReader (on Linux) only shows [image]:

Altherrenjagd - KOReader_001

The was an issue once with KOReader about displaying [image] when MuPDF (the renderer used) can’t find the images, so I still assume KOReader can—in principle—display SVG.

But @Frenzie said: »As an aside, even if it's still an issue in 1.17/1.18, it'll support base64-encoded images which you always could use as a workaround.«

Moonbase59 commented 2 years ago

Foliate (Linux)—doesn’t currently support HTML StarDict, only "m" type:

Test Fraktur_005

lasconic commented 2 years ago

And the PNG one ? does it work in KOReader ?

Moonbase59 commented 2 years ago

I’m testing too many things at once, lol … got a link or a quick test file?

Frenzie commented 2 years ago

@Moonbase59 The most pertinent reference would be https://github.com/koreader/koreader/pull/7057 regardless.

Changing the dictionary itself to use base64 is surely a lot more effort — and in any case the MuPDF upgrade is a bit on hold atm unfortunately.

Moonbase59 commented 2 years ago

@Frenzie: Thanks for chiming in! Is that function built into KOReader? And if yes, could it be extended to also handle png and svg, ideally within KOReader, to avoid [image] being displayed?

From your standpoint, how should an image link within a StarDict dict look like, ideally?

src="image.svg"
src="res/image.svg"
?

Frenzie commented 2 years ago

Yes, see https://github.com/koreader/koreader/wiki/Dictionary-support#html-encoding-within-stardict-dictionaries-supported for how to use it.

From your standpoint, how should an image link within a StarDict dict look like, ideally?

Given that it's supposed to be rref a transformation function like that should likely be applied by default but I've merely provided some scaffolding so that someone who has such dictionaries can investigate the situation on the ground better.

Frenzie commented 2 years ago

See https://github.com/huzheng001/stardict-3/blob/96b96d89eab5f0ad9246c2569a807d6d7982aa84/dict/doc/StarDictFileFormat#L409-L410 and https://github.com/huzheng001/stardict-3/blob/96b96d89eab5f0ad9246c2569a807d6d7982aa84/dict/doc/StarDictFileFormat#L431-L444 for the specification.

Moonbase59 commented 2 years ago

@Frenzie: Thanks for the pointers, much appreciated. Since you’re surely involved more deeply in that than I am, a further question, if I may:

Does that mean we either have to use sametypesequence=hr and provide a resource dict, or can just assume it’ll find everything in the /res folder if not providing a resource dict and using only sametypesequence=h?

The image reference should then look more like <rref type="image">image.svg</rref>? Instead of using <img>?

Moonbase59 commented 2 years ago

@BoboTiG, @lasconic, @ilius: This would probably mean some special handling for StarDict type output. Could we setup some reliable test case?

Guess the others (Kobo) will be happy with pure HTML and base64-encoded images, right?

Frenzie commented 2 years ago

Since you’re surely involved more deeply in that than I am

I wouldn't be so sure. If I'm reading something more challenging in French, I prefer to read with my paper Petit Larousse and/or Petit Robert next to me. ;-)

Does that mean we either have to use sametypesequence=hr and provide a resource dict, or can just assume it’ll find everything in the /res folder if not providing a resource dict and using only sametypesequence=h?

If you're specifically concerned with sdcv in combination with KOReader, you should use sametypesequence=h. Do you have a sample of one of these hr types somewhere?

The image reference should then look more like image.svg? Instead of using ?

It doesn't matter as long as it's something that can be easily matched by string.gsub() or equivalent. No attempt to do anything like that is built in at the moment.

Guess the others (Kobo) will be happy with pure HTML and base64-encoded images, right?

Note that assuming what I wrote was correct, that means at least in KOReader that won't work right now. The MuPDF version isn't 1.17/1.18/1.19, but still 1.13.

Moonbase59 commented 2 years ago

If you're specifically concerned with sdcv in combination with KOReader, you should use sametypesequence=h. Do you have a sample of one of these hr types somewhere?

What is sdcv? No example of 'hr', sorry. Seems nobody uses the more exotic forms of StarDict. ;-) Although there are some possibilities hidden there… I always wondered if a combined 'hm' version was possible, for instance.

It doesn't matter as long as it's something that can be easily matched by string.gsub() or equivalent. No attempt to do anything like that is built in at the moment.

Does this mean we (or pyglossary) would have to provide a dictname.lua, containing any such conversions?

Note that assuming what I wrote was correct, that means at least in KOReader that won't work right now. The MuPDF version isn't 1.17/1.18/1.19, but still 1.13.

So you’re saying the current KOReader can’t show any base64-encoded images in HTML StarDicts? Any ETA for a newer MuPDF being integrated?

Since KOReader seems to become a more and more important platform, I feel this project should generate dictionaries for, in order of priority

Kobo (the original intention, I guess)
further conversion (DictFile)
StarDict (KOReader, most other readers and dictionary software can use these)
Tolino (.quickdic v6 format; not yet supported but many Tolino e-readers out there)

lasconic commented 2 years ago

Regarding SVG and base64, it seems kobo can support non base64 encoded SVG generated with the following code: (again using latest Kobo software 4.31.19086

    dvioptions = []
    with BytesIO() as buf, BytesIO() as im:
        preview(
            f"${expr}$",
            output="svg",
            viewer="BytesIO",
            outputbuffer=buf,
            dvioptions=dvioptions,
            packages=tuple(packages),
        )

        buf.seek(0)
        raw = buf.read()
    import urllib.parse

    return f'<img style="{IMG_CSS}" src="data:image/svg+xml;charset=utf8,{urllib.parse.quote(raw)}"/>'

The zipped kobo en dictionary with "graph" is 7,037 bytes with this code vs 4,052 bytes with GIF vs 12,342 bytes with base64 encoded SVG.

My understanding is that pyglossary can now handle base64 encoded SVG (meaning extract them) and create a stardict file but it would need to be changed to support non base64 encoded SVG.

lasconic commented 2 years ago

@BoboTiG How do you want to continue with this matter ?

Frenzie commented 2 years ago

@Moonbase59

What is sdcv?

For better or worse, https://github.com/Dushistov/sdcv is what's used behind the scenes. We ask it for results in a JSON format, which among a few other features was added specifically by/for us.

Does this mean we (or pyglossary) would have to provide a dictname.lua, containing any such conversions?

If you want it to work right this very second, yes. But otherwise there's no need for that as long as you do something sensible. I would assume that <reff type="image"> and <img src=""> are the primary candidates for something sensible. I'll make sure it works, and don't be afraid to ping me. ;-)

[edit] Actually these SVG files are too complex for any version of MuPDF I'm afraid, presumably due to the use of SVG fonts. Anyway, for the test file supplied above, a quick working proof of concept would be something like:

return function(html, dict_path)
    html = html:gsub('src="%./([^"].-)"', 'src="'..dict_path..'res/%1"')
    return html
end

But for use in KOReader they should be converted to simple paths (afaict that's pretty much all these are in the first place, doesn't exactly look like it's using the font part of the font — no wait, one of them uses some brackets twice :-) ), or a PNG would also do the trick. [edit]

So you’re saying the current KOReader can’t show any base64-encoded images in HTML StarDicts? Any ETA for a newer MuPDF being integrated?

If what I wrote back in 2020 is correct. At a quick glance it looks like 1.15 was the first to do it officially but there's a small chance it'd only take a couple of minutes to backport since base64 support itself has been in MuPDF for much longer. But base64 will inflate file sizes, so I don't think there's much of an upshot to it in most scenarios either way.

There's no ETA because the developer who was working on it kind of disappeared; I hope they're okay. None of us has really had the time to take over.

Moonbase59 commented 2 years ago

@Frenzie: Thanks for being so helpful! Guess we keep on testing, since SVG for formulae would really be a killer feature.

@lasconic: Looks like a tightrope walk between functionality and compatibility with most readers. I’d love to see SVG but we must make sure it works on Kobo, readers using StarDict, and maybe eventually Tolinos.

As the "next bad" alternative, I’d suggest transparent PNGs.

BoboTiG commented 2 years ago

Could you recap what is supported on what please? External SVG files seem the best choice, or I missed something?

Frenzie commented 2 years ago

@BoboTiG External SVG files are supported in KOReader, but the MuPDF library used for rendering only supports a limited subset of SVG. The simplest way to check what they will look like is to check in MuPDF 1.13 available here (and 1.18/1.19 if you want to see what it will look like in the future).

These specific SVGs use SVG fonts, which don't seem to be supported by MuPDF (nor is fallback text rendering apparently, unless there's yet another factor involved). For that matter neither do Firefox or Chromia, and the Firefox people expressed incredulity that I wanted it to support SVG fonts 15 years ago, but that aside.

A conversion to paths should do the trick and given the overhead from the font definitions I'm seeing in these samples it might even be smaller to boot. Unfortunately it looks like Inkscape has some issues with the SVG fonts too, but I mean the equivalent of something like this, except with the correct result:

inkscape --export-text-to-path da1264de.svg -o out.svg

Inkscape simply strips out the fonts there…

Moonbase59 commented 2 years ago

I think—but @lasconic has done all the testing—what we might need is:

base64 encoding for Kobo dicthtml, since file:/// and dict:/// usually fail one way or the other.
external image and sound files for StarDict, in the res folder
PNG (using transparency) preferred over GIF in case of raster images

SVG needs some more testing. Some thoughts:

The easiest way might be best: Try to embed as <svg>…</svg> in the dict HTML and see if both Kobo and readers using StarDict can render these. This might also be advantageous for CSS.
We currently should generate a type of SVG that can be rendered using MuPDF 1.13, for KOReader (which definitely gains a hefty user base). It looks like we use dvisvgm to create the SVGs, how about adding --no-fonts to its command line?
If this doesn’t work out, we might need external files for StarDict.
We definitely need a way to set an SVG’s fill color, using CSS, for
- readers that support color
- non E-Ink readers
This is needed, for example, to set a "night mode". E-Ink readers usually just invert the (b/w) display image, so we are in luck here, but color devices and LCD displays need the color of the SVG fill changed.

Moonbase59 commented 2 years ago

Here’s a quick-n-dirty test I hacked together (generate no-fonts SVG from LaTeX math): svg-test.zip

Result: file

fill works, too (here in the SVG, due to GH restraints): file-red

Formula stolen from FR Wiktionary, "cercle unité".

@Frenzie: Care to test? ;-)

Frenzie commented 2 years ago

That works perfectly in current MuPDF, but not in the older one. :-( I'm fairly sure <defs> should be supported though; I'll have to investigate it later.

It looks like I may have been wrong about the SVG fonts in some sense btw; MuPDF has had code for converting symbols into paths for its own internal use for some 8 years now.

I'll try to investigate what specifically is breaking some other day when I have more time.

lasconic commented 2 years ago

It looks like we use dvisvgm to create the SVGs, how about adding --no-fonts to its command line?

Sure, no problem. The test here https://github.com/BoboTiG/ebook-reader-dict/issues/1198#issuecomment-1028357724 is done with --no-fonts, the default if dvioptions is empty.

The best for Kobo is

'<img style="{IMG_CSS}" src="data:image/svg+xml;charset=utf8,{urllib.parse.quote(rawsvg)}"/>'

PyGlossary would need to be changed so it can export the svg to file. Then, the question is to which file... should it keep SVG or convert PNG for best StarDict support. And I guess it's what @Frenzie and @Moonbase59 try to figure out.

Frenzie commented 2 years ago

Well, I can tell you without testing anything that PNG will almost certainly have wider compatibility than SVG. I don't think http://stardict-4.sourceforge.net/ supports SVG for example. A (massaged) SVG would be for KOReader and GoldenDict. PNG is the safe, but sometimes or usually a bit blurry choice.

Frenzie commented 2 years ago

@Moonbase59 The answer seems to be fairly simple: MuPDF <1.15 either has trouble with negative values in the viewBox or ignores the viewBox completely. Unfortunately I can't quickly patch that in. I'm not really sure why dvisvgm is coming up with such a complex viewBox in the first place though?

Moonbase59 commented 2 years ago

@Frenzie Don’t know, but I see many tools coming up with such negative values. Maybe done to generate something that better stays on a text line? Just wild assumptions. But we probably won’t be able to avoid it.

You’re surely right in saying "wider compatibility", but it has to start somewhere…

I spent the whole day (and night) yesterday to come up with something that works so-so. And fight all the bugs like tools generating the same IDs in SVGs, which leads to real odd effects when more than one SVG is used on the same page (even across articles, because GoldenDict, for instance, can show more than one entry on the same page).

@lasconic Why is an <img> better than a "pure" <svg> for Kobo? In my testing I found that actually in-place SVGs worked better (at least with GoldenDict, the supposed successor of StarDict). It also can be styled much easier using CSS.

I wrote some commandline stuff that allows my Rexx script to have the Wiktionary Latex <math>…</math> generate SVGs and embed these, and thus produce a StarDict that looks (and scales) much better than any GIF oder PNG. Plus, can have CSS to set the SVG fill for, say ilius’ "night mode" for GoldenDict. (Only 7 didn’t render from the German Wiktionary, du to bad/old LaTeX commands used.)

Normalverteilung - GoldenDict_001

Normalverteilung - GoldenDict_002

Normalverteilung - GoldenDict_003

See how bad positioning is, and how nice transparency, scaling and CSS’ing the color are? :-)

Pyglossary seems to have no problems keeping these in the files (without generating externals), and even generates a good-looking .df from my (tab-file) source. Since I have no Kobo, I plan to later try and generate a dicthtml too, and post the StarDict, .df and dicthtml somewhere for you to try. Seems the most complicated problem is actually the correct positioning of the SVGs in the text (as is with GIF and PNG), and avoiding same IDs. Too bad the German Wiktionary hasn’t as many formulas as the French one, but we’ll surely find words to check out.

What I do is let my script convert the entries, and whenever it stumbles over a <math…</math>, it calls up the commandline renderer to render the LaTeX math and return an optimized and minified SVG. I use a combo of LaTeX, dvisvgm and svgo with a modified svgo.config.js for that.

Sure it’s not yet the output of our project, but the best (and quickest) way to test if SVG would be feasible. After all, there are many people out there that use even more platforms. In Germany, the market is pretty much dominated by Kindle, Tolino, and Pocketbook (in that order), but people also use Kobo, B&N Nook, Boyue, Onyx Books, and so forth. Android apps like CoolReader, Moon Reader, Librera, and KOReader are also often used. On the desktop, we have things like GoldenDict, Foliate, Thorium Reader, KOReader (for testing) and others. Many of these support StarDict.

Let me test a little more here and then upload my German test dicts in StarDict, .df and dicthtml for you to test, ok?

If anyone’s interested, I can also share my render-svg commandline stuff, but it’s rather hacky still, you’d need to install some stuff (svgo) and adapt a few paths.

It will accept a LaTeX math string and return a minified SVG on stdout, like in

render-svg 'a+b'

and return

<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="34.053" height="12.287" viewBox="0 -9.215 25.54 9.215" xmlns="http://www.w3.org/2000/svg"><defs><path id="1iTH_svg__b" d="M4.77-2.762h3.3c.167 0 .382 0 .382-.215 0-.227-.203-.227-.382-.227h-3.3v-3.3c0-.167 0-.382-.215-.382-.227 0-.227.203-.227.382v3.3h-3.3c-.167 0-.382 0-.382.215 0 .227.203.227.382.227h3.3v3.3c0 .167 0 .383.215.383.227 0 .227-.204.227-.383v-3.3Z"/><path id="1iTH_svg__a" d="M3.599-1.423c-.06.204-.06.227-.228.455-.263.334-.789.848-1.35.848-.49 0-.766-.442-.766-1.147 0-.658.37-1.997.598-2.499.407-.837.968-1.267 1.435-1.267.789 0 .944.98.944 1.076 0 .012-.036.167-.048.191L3.6-1.423Zm.765-3.06c-.132-.311-.455-.79-1.076-.79-1.351 0-2.81 1.746-2.81 3.516C.478-.574 1.172.12 1.985.12c.657 0 1.219-.515 1.554-.909.12.705.681.909 1.04.909s.645-.216.86-.646c.192-.407.36-1.136.36-1.184 0-.06-.049-.107-.12-.107-.108 0-.12.06-.168.239-.179.705-.406 1.458-.896 1.458-.347 0-.37-.31-.37-.55 0-.274.035-.406.143-.872.083-.3.143-.562.239-.909.442-1.793.55-2.223.55-2.295a.299.299 0 0 0-.311-.3c-.383 0-.478.42-.502.563Z"/><path id="1iTH_svg__c" d="M2.762-7.998c.012-.048.036-.12.036-.18 0-.119-.12-.119-.144-.119-.012 0-.442.036-.657.06-.204.012-.383.036-.598.048-.287.024-.37.036-.37.25 0 .12.119.12.238.12.61 0 .61.108.61.227 0 .084-.096.43-.144.646l-.286 1.148c-.12.478-.801 3.192-.85 3.407-.059.299-.059.502-.059.658C.538-.514 1.219.12 1.997.12c1.386 0 2.82-1.782 2.82-3.515 0-1.1-.62-1.877-1.517-1.877-.622 0-1.184.514-1.411.753l.873-3.479ZM2.008-.12c-.382 0-.8-.286-.8-1.219 0-.394.035-.622.25-1.459.036-.155.228-.92.275-1.075.024-.096.73-1.16 1.543-1.16.526 0 .765.526.765 1.148 0 .573-.335 1.924-.634 2.546-.299.646-.849 1.22-1.399 1.22Z"/></defs><g transform="translate(-12.77 3.695)"><use xlink:href="#1iTH_svg__a" x="12.77" y="-4.608"/><use xlink:href="#1iTH_svg__b" x="21.572" y="-4.608"/><use xlink:href="#1iTH_svg__c" x="33.333" y="-4.608"/></g></svg>

test

lasconic commented 2 years ago

@lasconic Why is an better than a "pure"

I doesn't display anything with just <svg>.

Moonbase59 commented 2 years ago

Phew, that’s a show stopper! Why do we always have to cope with quirks, old software, and incompatibilites? Sigh…

ilius commented 2 years ago

Phew, that’s a show stopper! Why do we always have to cope with quirks, old software, and incompatibilites? Sigh…

I still compile GoldenDict with Qt4 on Linux because of bad rendering of Persian/Arabic text with Qt5!

lasconic commented 2 years ago

To solved the issue at hand in this project I would propose the following:

Export data:image/svg+xml;charset=utf8,{urllib.parse.quote(raw_svg)} for Kobo dictionary format. It will result in clean SVG for kobo users
Export "data:image/png;base64,{b64encode(raw_png).decode()}" for df files and pyglossary will use PNG to create StarDict with PNGs for the broader support.

@illius Qt6 is here :)

Moonbase59 commented 2 years ago

@ilius For debugging, is it easily possible to keep the xx.raw.html files produced in dicthtml generation? Or aren’t they anymore produced?

Moonbase59 commented 2 years ago

Here is a quick one for you I just generated for testing. Try "Normalverteilung" or such.

dicthtml-de-de-svg.zip

ilius commented 2 years ago

@ilius For debugging, is it easily possible to keep the xx.raw.html files produced in dicthtml generation? Or aren’t they anymore produced?

You mean in converting to Kobo with PyGlossary? *.html files in the zip are actually gzipped html, so you can view them with zless or zcat (also zgrep) in Linux. Not sure about BSD / Mac

Moonbase59 commented 2 years ago

Thanks! Didn’t even know I had zcat and zless, cool!

Moonbase59 commented 2 years ago

Export data:image/svg+xml;charset=utf8,{urllib.parse.quote(raw_svg)} for Kobo dictionary format. It will result in clean SVG for kobo users

I wonder if it’s possible to use CSS on these, like in:

svg {
  vertical-align: middle;
  display: inline-block;
  fill: white;
}

Frenzie commented 2 years ago

@Moonbase59

Maybe done to generate something that better stays on a text line? Just wild assumptions.

No, you can easily adjust the viewBox and paths, even with an argument to dvisvgm itself. I.e., if you pass for example --translate=1,15 when the viewBox starts with something like -1 -15 then it ends up as something close but not quite right. That is, it doesn't end up at 0 0.

The viewBox can be relevant, but here it definitely isn't and it's just weird that it doesn't start at 0 0. I did find this but ultimately it doesn't say much I didn't already know.

Anyway, while I might be tempted to patch old MuPDF to parse the viewBox properly (or at least properly enough to deal with these very specific dvisvgm files), or to do the same adjustment in Lua, I definitely won't promise that.

Moonbase59 commented 2 years ago

Just had a video conference with a friend who happens to own a Kobo Aura HD, and we tried my (pure) SVG variant of dicthtml-de-de I linked above. She had firmware 3.19.5613, we later upgraded to 4.31.19086, and both versions displayed the SVGs just fine, albeit a little small.

Maybe we really should go for SVG, and hope the readers will follow… I’d love it.

I post a few screenshots here (made with an old smartphone, apologies).

Before upgrade (3.19.5613):

IMG-20220204-WA0010
"Normalverteilung" has 3 SVGs: a larger formula, and two inside the text.

IMG-20220204-WA0011
It is too small, but perfectly rendered. Maybe the reader’s CSS?

After upgrade (4.31.19086):

IMG-20220204-WA0026

IMG-20220204-WA0025

Moonbase59 commented 2 years ago

Do we have latex and dvisvgm in our build environment? And could we also have svgo please?

Reason being, SVG is a monster and can apparently only function correctly as <svg>, not <img>. I’m working on a Python3 module that can make "good" SVGs from LaTeX math. These will have their width and height in em units (so they will scale with font-size) plus an embedded style="vertical-align:-x.xxxxem" to facilitate correct vertical alignment on the baseline. This means formulae won’t "jump up and down" in between text anymore.

I need svgo to produce embeddable minified "oneliner" SVGs without any XML declarations and/or comments. Also, svgo shall produce (pseudo-)unique IDs in the SVGs, so that nothing bad happens when more than one SVG is used on a page.

First tests in a browser look very promising:

Mozilla Firefox_162

BoboTiG commented 2 years ago

Which svgo are we talking about? Can you share a link?

Anyway, there is no real restrictions about dependencies but the installation ease.

Moonbase59 commented 2 years ago

Sure, just a mo…

https://github.com/svg/svgo

Unfortunately requires Node, but what the heck. ;-) The only one I could find that really works well.

New status of testing:

Mozilla Firefox_163

A (newer) HTML file for you to experiment with: test.html.zip

Moonbase59 commented 2 years ago

I’m currently running the whole German Wiktionary through this and consider it stable enough to use for the moment. I’ll post my (script-generated) StarDict and Kobo versions and would ask you to verify the output on your actual devices please (I’ll list some words to check).

If your feedback is positive, I could put it on GitHub, so you can pull it and maybe do some experiments with the project’s code. So far at least pyglossary seems to handle the embedded & minified SVGs gracefully on conversions.

The code is still a little hacky and has hefty I/O due to all the things involved (LaTeX, dvisvgm, svgo) but it works. Plus, we have the ability to include some workarounds for "bad" Wiktionary LaTeX code in latex2svg’s LaTeX preamble. The whole German Wiktionary (799,031 definitions) now only has one LaTeX error left (using \Zi), yay!

Next step could be (shiver…) to let it loose on the huge French Wiktionary, which altogether has many more formulae than the German.

Moonbase59 commented 2 years ago

Here’s an example how it can be used in Python3:

Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from latex2svg import latex2svg
>>> out = latex2svg(r'$B \subset \bigcup_{i \in I} A_i$')
>>> print(out['width'], out['height'], out['valign'])
4.692758 0.934191 -0.231567
>>> print(out['svg'])
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="4.693em" height=".934em" viewBox="-0.12 -8.369 55.893 11.127" style="vertical-align:-.231567em"><defs><path id="XbVm_a" d="M5.1-.07v-.437H2.569C1.72-.507.89-1.267.89-2.245c0-.926.725-1.738 1.677-1.738H5.1v-.437H2.603C1.389-4.42.454-3.406.454-2.245S1.39-.07 2.603-.07H5.1Z"/><path id="XbVm_h" d="M4.227-.07v-.437h-1.66c-.785 0-1.554-.646-1.659-1.52H4.08v-.436H.909c.095-.83.785-1.52 1.659-1.52h1.66v-.437H2.602C1.389-4.42.454-3.406.454-2.245S1.39-.07 2.603-.07h1.624Z"/><path id="XbVm_b" d="M2.681-2.472c-.14 0-.2-.017-.2-.061 0-.017 0-.035.017-.052l1.267-2.219h.017l.358 2.332H2.681ZM1.406-.69l.734-1.266c.061-.114.122-.149.323-.149H4.21l.218 1.476a.242.242 0 0 1 .01.079c0 .192-.14.27-.429.297l-.244.017c-.035 0-.062.026-.07.07L3.677 0l.009.017C4 .01 4.472 0 4.796 0c.34 0 .68.009.986.017L5.8 0l.017-.166c0-.044-.026-.07-.07-.07L5.66-.245C5.328-.28 5.188-.41 5.136-.75l-.804-4.734c-.026-.166-.052-.262-.174-.262s-.201.078-.341.314L1.066-.89C.742-.349.498-.262.13-.236c-.044 0-.07.026-.079.07L.026 0l.009.017C.349.01.489 0 .812 0c.34 0 .786.009 1.092.017L1.93 0l.018-.166c.008-.044-.018-.07-.053-.07L1.66-.253c-.219-.018-.315-.105-.315-.236a.38.38 0 0 1 .061-.201Z"/><path id="XbVm_c" d="M2.533-4.979c.052-.288.07-.375.804-.375.41 0 .908.2.908.847 0 .096-.018.192-.035.306-.149.742-.708 1.004-1.406 1.004h-.62l.349-1.782Zm-.402 2.08h.577c.987 0 1.397.497 1.397 1.135 0 .104-.017.21-.035.323C3.94-.804 3.59-.28 2.315-.28c-.498 0-.638-.078-.638-.279 0-.035.009-.079.017-.122L2.131-2.9ZM.175-.167.14 0l.017.017C.55.01.891 0 1.223 0s.445.017 1.196.017c1.669 0 2.28-.873 2.429-1.624.017-.114.035-.218.035-.323 0-.638-.42-1.022-1.014-1.214v-.018c.463-.218.917-.672 1.005-1.144.017-.105.035-.21.035-.323 0-.507-.297-1.022-1.572-1.022-.297 0-.682.017-1.022.017-.324 0-.664-.008-1.057-.017l-.026.017-.035.175c0 .035.008.061.052.061l.245.018c.288.017.384.052.384.21 0 .043-.009.095-.018.156L1.004-.62c-.06.306-.148.34-.506.367l-.253.017c-.044 0-.062.026-.07.07Z"/><path id="XbVm_i" d="M1.86-4.97 1.013-.664C.961-.419.83-.297.472-.262L.2-.236C.166-.2.148-.06.183.017.576.01.891 0 1.232 0c.331 0 .724.009 1.109.017.026-.034.052-.21.026-.253L2.07-.262c-.34-.026-.437-.14-.384-.402l.847-4.306c.052-.244.183-.367.541-.402l.271-.026c.035-.035.053-.175.018-.253-.393.009-.708.017-1.048.017-.332 0-.725-.008-1.11-.017-.026.035-.052.21-.026.253l.297.026c.34.027.437.14.384.402Z"/><path id="XbVm_g" d="M1.756-4.979c0 .228.192.42.419.42s.42-.192.42-.42-.193-.419-.42-.419-.42.192-.42.42Zm.384 2.184c.079-.34.183-.681.183-.83 0-.122-.07-.183-.183-.183-.28 0-.673.061-1.083.113-.061.07-.044.166.009.245l.349.026c.105.009.157.087.157.175 0 .078-.017.2-.078.445l-.41 1.8c-.062.279-.15.637-.15.803s.097.288.359.288c.419 0 .812-.297 1.214-.812-.026-.079-.079-.148-.201-.148-.227.262-.48.436-.568.436-.061 0-.087-.043-.087-.148 0-.079.043-.271.096-.498l.393-1.712Z"/><path id="XbVm_f" d="M9.183 6.539V0h-.596v6.61c0 1.727-1.524 3.43-3.501 3.43-1.87 0-3.514-1.488-3.514-3.442V0H.977v6.55c0 2.3 1.905 4.074 4.109 4.074S9.183 8.85 9.183 6.54Z"/><use id="XbVm_e" xlink:href="#XbVm_a" transform="scale(1.36364)"/><use id="XbVm_j" xlink:href="#XbVm_b" transform="scale(1.36364)"/><use id="XbVm_d" xlink:href="#XbVm_c" transform="scale(1.36364)"/></defs><use x="-.12" xlink:href="#XbVm_d"/><use x="10.757" xlink:href="#XbVm_e"/><use x="22.04" y="-8.369" xlink:href="#XbVm_f"/><use x="31.887" y="2.662" xlink:href="#XbVm_g"/><use x="35.083" y="2.662" xlink:href="#XbVm_h"/><use x="39.826" y="2.662" xlink:href="#XbVm_i"/><use x="45.377" xlink:href="#XbVm_j"/><use x="53.179" y="1.793" xlink:href="#XbVm_g"/></svg>
>>> with open('test.svg', 'w') as f:
...     f.write(out['svg'])
... 
3667
>>>

test

As a command line tool, it reads stdin and writes to stdout:

echo '$B \subset \bigcup_{i \in I} A_i$' | ./latex2svg.py > test.svg

produces the exact same result.

Moonbase59 commented 2 years ago

Current versions of my script output (German Wiktionary):

dicthtml-de-de-svg.zip (rename to dicthtml-de-de.zip)
MCH DE-DE Stardict.zip

Some phrases to test (which include formulae):

Wort
Bruch
Alphabet
Palindrom
alternierend
Periode
konfluent
leeres Wort
rekursiv
formale Sprache
Normalverteilung
Integrationsproblem

Looks rather usable on my friend’s Kobo Aura HD:

IMG-20220206-WA0000 IMG-20220206-WA0001 IMG-20220206-WA0002 IMG-20220206-WA0003 IMG-20220206-WA0004

Moonbase59 commented 2 years ago

So here we go… https://github.com/Moonbase59/latex2svg

Let me know what you think.

BoboTiG commented 2 years ago

So here we go… https://github.com/Moonbase59/latex2svg

Let me know what you think.

I'll have time to have a look later in the week hopefully.

lasconic commented 2 years ago

Did you try scour ? https://github.com/scour-project/scour

BoboTiG / ebook-reader-dict

Generate SVG rather than GIF for embedded pictures #1198